Candle is a minimalist machine learning (ML) framework written in Rust, developed by Hugging Face. Its primary purpose is to provide a high-performance, user-friendly environment for running and developing ML models, with a particular focus on enabling serverless inference and removing Python dependencies from production workloads. The framework prioritizes performance, including robust GPU support, and aims to be easy to learn and use, drawing inspiration from PyTorch's syntax.
The core functionality of Candle revolves around its `Tensor` struct and associated operations, mirroring the functionality of similar frameworks. It offers a streamlined approach to building and deploying ML models, making it suitable for both research and production environments. The framework's design emphasizes lightweight binaries, which contributes to faster instance creation in cluster environments, a key advantage for serverless deployments.
Candle's main features include a simple syntax that closely resembles PyTorch, making it easier for users familiar with that framework to transition. It supports model training and allows for the integration of custom operations and kernels, such as flash-attention v2, for optimized performance. The framework boasts multiple backends, including an optimized CPU backend with optional MKL support for x86 and Accelerate for macOS, and a CUDA backend for efficient GPU utilization, including multi-GPU distribution via NCCL. Furthermore, Candle offers WASM support, enabling models to run directly in web browsers.
A significant strength of Candle lies in its extensive collection of pre-built models. It supports a wide range of state-of-the-art models, including various language models like LLaMA (v1, v2, and v3), Falcon, StarCoder, Phi, Mamba, Gemma, Mistral, Mixtral, StableLM, Replit-code, Bert, Yi, Qwen, and RWKV. It also provides support for quantized LLMs, text-to-text models (T5, Marian MT), text-to-image models (Stable Diffusion, Wuerstchen), image-to-text models (BLIP, TrOCR), and computer vision models (DINOv2, YOLO, SAM, SegFormer). This broad model support allows users to quickly experiment with and deploy various ML tasks without extensive model building from scratch.
Candle supports loading models from various file formats, including safetensors, npz, ggml, and PyTorch files, providing flexibility in model integration. It also offers serverless deployment capabilities on CPU, enabling small and fast deployments. Furthermore, the framework provides quantization support using the llama.cpp quantized types, optimizing memory usage and inference speed.
The repository includes several examples demonstrating how to use Candle for various tasks, including text generation, image recognition, and speech recognition. These examples showcase the framework's ease of use and its ability to run complex models. The documentation also provides a "cheatsheet" comparing Candle operations to their PyTorch equivalents, further aiding users in their transition. Candle also benefits from a growing ecosystem of external resources, including tutorials, LoRA implementations, and libraries for various tasks like video generation, optimizers, and model serving. Overall, Candle is a promising framework for those seeking a performant, Rust-based alternative for ML development and deployment, particularly for serverless and production environments.