optimum-quanto
by
huggingface

Description: A pytorch quantization backend for optimum

View on GitHub ↗

Summary Information

Updated 2 hours ago

Added to GitGenius on June 4th, 2024

Created on September 19th, 2023

Open Issues & Pull Requests: 2 (+0)

Number of forks: 91

Total Stargazers: 1,045 (+0)

Total Subscribers: 8 (+0)

Issue Activity (beta)

Open issues: 0

New in 7 days: 0

Closed in 7 days: 0

Avg open age: N/A days

Stale 30+ days: 0

Stale 90+ days: 0

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

Stale (92)
question (9)
bug (8)
enhancement (8)
help wanted (7)
good first issue (4)
duplicate (2)
wontfix (2)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 10.8 hours

Mean response time: 8.1 days

90th percentile: 21.7 days

Tracked items: 125

Most active contributors

dacorvo - 334 events, 112 issues
sayakpaul - 20 events, 9 issues
SunMarc - 12 events, 7 issues
bghira - 11 events, 4 issues
BenjaminBossan - 10 events, 2 issues

Related by overlapping contributors

Detailed Description

Optimum Quanto is a PyTorch quantization backend designed as part of the Hugging Face Optimum ecosystem. The project provides tools for converting full-precision neural network models into quantized versions that consume less memory and can run faster on compatible hardware, particularly CUDA devices. The repository is currently in maintenance mode, accepting only minor bug fixes and documentation improvements rather than major new features.

The core functionality centers on quantizing model weights and activations to lower bitwidths, supporting int2, int4, int8, and float8 weight representations along with int8 and float8 activation quantization. A distinctive feature is that all quantization operations work in eager mode, meaning they function with non-traceable models and can place quantized models on any device including CUDA and MPS. The library automatically inserts quantization and dequantization operations, replaces standard PyTorch modules with quantized equivalents, and provides a workflow progression from float models through dynamic quantization to static quantized versions.

The quantization workflow accommodates both high-level APIs for Hugging Face models and low-level APIs for vanilla PyTorch models. For Hugging Face transformers and diffusers, the library offers helper classes that simplify quantization, saving, and reloading. The low-level workflow involves quantizing a model, optionally calibrating it on representative data to determine activation ranges, performing quantization-aware training if needed, freezing integer weights, and serializing the result. Serialization supports both PyTorch pickle format and safetensors, with compatibility for PyTorch's weight-only quantization standard.

At the technical foundation, Quanto uses a custom Tensor subclass that projects source tensors into optimal ranges for destination types and maps values accordingly. For integer types, this involves rounding operations, while floating-point types use native PyTorch casting. The projection mechanism minimizes saturated and zeroed values through symmetric per-tensor or per-channel approaches for int8 and float8, and group-wise affine quantization for lower bitwidths. The library provides quantized module implementations for Linear, Conv2d, and LayerNorm layers, with weights quantized per-channel along the output dimension while biases remain unquantized to preserve arithmetic accuracy.

Performance characteristics show that models quantized with int8 or float8 weights and float8 activations maintain accuracy close to full-precision versions. When optimized kernels are available, inference latency remains comparable to full-precision models when quantizing only weights. Device memory usage is approximately divided by the ratio of float bits to integer bits. The repository includes accelerated matrix multiplication kernels for CUDA devices supporting int8-int8, fp16-int4, bf16-int8, and bf16-int4 operations.

According to GitGenius activity tracking, the repository shows median issue and pull request response latency of 10.8 hours across 125 tracked items, with dacorvo being the most active contributor at 334 events. The most common issue labels are Stale, question, and enhancement. The project maintains connections with related Hugging Face repositories including diffusers and transformers, as well as PyTorch's main repository, reflecting its integration within the broader machine learning ecosystem. Notable unimplemented features include dynamic activation smoothing, kernels for all mixed matrix multiplications across all devices, and compatibility with PyTorch's torch.compiler.

optimum-quanto
by
huggingface

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

optimum-quanto
by
huggingfacehuggingface/optimum-quanto

Repository Details

optimum-quanto by huggingface

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

optimum-quanto by huggingfacehuggingface/optimum-quanto

Repository Details

optimum-quanto
by
huggingface

optimum-quanto
by
huggingfacehuggingface/optimum-quanto