Description: A pure-rust port of tiktoken.
View anysphere/tiktoken-rs on GitHub ↗
tiktoken-rs is a Rust implementation of OpenAI's tiktoken library, a fast BPE (Byte Pair Encoding) tokenizer used for models like GPT-3, GPT-4, and others. The primary goal of this crate is to provide a performant and accurate Rust equivalent, allowing developers to estimate token counts without relying on Python or external processes. This is particularly valuable for applications where latency and resource usage are critical, such as building APIs, cost estimation tools, or pre-processing text data for large language model (LLM) interactions.
The core functionality revolves around loading and utilizing various tiktoken encodings. The library supports loading encodings directly from the JSON files OpenAI provides (e.g., `cl100k_base.json`, `r50k_base.json`), as well as handling special cases like GPT-2 encodings. It provides functions to encode text strings into token IDs and decode token IDs back into strings. Crucially, it accurately replicates the tokenization behavior of the original Python tiktoken library, ensuring compatibility and predictable results when interacting with OpenAI models. The crate doesn't just focus on basic encoding/decoding; it also includes features to handle mergeable ranks, special tokens (like `<|endoftext|>`), and unknown tokens.
A key feature is the `Encoding` struct, which encapsulates the tokenizer's vocabulary and rules. The `Encoding::from_file()` function is the primary way to load an encoding from a JSON file. Once loaded, the `Encoding` provides methods like `encode()`, `encode_normal()`, and `decode()` for performing tokenization and detokenization. The `encode_normal()` method is particularly important as it behaves identically to the `tiktoken.encoding.encode_normal()` function in Python, handling special tokens in a consistent manner. The crate also offers methods for getting the vocabulary size and retrieving the special tokens defined in the encoding.
Performance is a significant focus of tiktoken-rs. The Rust implementation leverages Rust's memory safety and speed to achieve tokenization speeds comparable to, and in some cases exceeding, the original Python library. This is achieved through careful data structure design and optimized algorithms for BPE tokenization. The crate is designed to be memory-efficient, minimizing allocations and copying of data. Benchmarks are included in the repository to demonstrate the performance characteristics of the library.
The repository includes comprehensive documentation, examples, and tests. The examples demonstrate how to load encodings, encode and decode text, and handle different tokenization scenarios. The tests cover a wide range of cases, ensuring the accuracy and robustness of the implementation. The project is actively maintained and welcomes contributions from the community. It's a valuable resource for Rust developers working with OpenAI models or needing a fast and reliable BPE tokenizer. The crate is available on crates.io, making it easy to integrate into Rust projects using Cargo.
Fetching additional details & charts...