AutoFP8
by
neuralmagic

Description: AutoFP8 is an open-source FP8 quantization library developed by Neural Magic for compressing large language models to run efficiently in vLLM.

View on GitHub ↗

Summary Information

Updated 18 minutes ago

Added to GitGenius on November 12th, 2024

Created on April 24th, 2024

Open Issues & Pull Requests: 16 (+0)

Number of forks: 27

Total Stargazers: 210 (+0)

Total Subscribers: 14 (+0)

Issue Activity (beta)

Open issues: 12

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 566 days

Stale 30+ days: 12

Stale 90+ days: 12

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

enhancement (1)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 1.0 hours

Mean response time: 15.7 days

90th percentile: 21.5 days

Tracked items: 24

Most active contributors

mgoin - 35 events, 16 issues
robertgshaw2-redhat - 7 events, 2 issues
realEbi - 6 events, 1 issues
IEI-mjx - 5 events, 2 issues
Syst3m1cAn0maly - 5 events, 1 issues

Related by overlapping contributors

Detailed Description

AutoFP8 is an open-source FP8 quantization library developed by Neural Magic for compressing large language models to run efficiently in vLLM. The repository focuses specifically on producing quantized checkpoints with weight, activation, and key-value cache scales optimized for FP8_E4M3 precision, enabling significant model compression while maintaining accuracy. However, the project has been deprecated in favor of llm-compressor, a more comprehensive library that handles multiple types of model compression beyond just FP8 quantization.

The core functionality of AutoFP8 centers on making FP8 quantization accessible through a straightforward API. The library introduces the AutoFP8ForCausalLM class and BaseQuantizeConfig objects that manage how models are compressed. Users load their model using AutoFP8ForCausalLM, tokenize calibration data, and pass it to the model.quantize function to perform calibration and compression. The quantized model can then be saved in a checkpoint format compatible with vLLM using the save_quantized method. This workflow is designed to be simple enough for practitioners to adopt without deep expertise in quantization techniques.

The repository maintains a curated collection of pre-quantized FP8 models available on Hugging Face, featuring many checkpoints that achieve less than one percent accuracy drop compared to their full-precision counterparts. These models are ready for immediate inference with vLLM, eliminating the need for users to perform quantization themselves if they want to use existing models. The quantized checkpoints include specific metadata in their config.json files and state_dict entries that define the quantization scheme, supporting both static and dynamic activation schemes depending on the configuration.

From an activity perspective, the repository shows focused development with median issue and pull request response latency of one hour across tracked items, indicating active maintenance during its operational period. The primary contributor mgoin drove 35 tracked events, with robertgshaw2-redhat and realEbi providing additional support with 7 and 6 events respectively. The repository is classified across multiple deep learning efficiency categories including model compression, quantization, inference efficiency, memory efficiency, and hardware acceleration, reflecting its specialized role in the optimization pipeline.

The library integrates with the broader Neural Magic ecosystem and shares contributors with related projects including vLLM, TileLang, and SGLang, suggesting coordinated development across complementary tools. Hardware support requirements are specific to GPUs with FP8 tensor cores, including Ada Lovelace, Hopper, and newer architectures, making the tool relevant for modern GPU deployments. While AutoFP8 itself is deprecated, its design patterns and the pre-quantized model collection it produced continue to serve as reference implementations for FP8 quantization workflows, with users directed to llm-compressor for ongoing development and expanded compression capabilities.

AutoFP8
by
neuralmagic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

AutoFP8
by
neuralmagicneuralmagic/AutoFP8

Repository Details

AutoFP8 by neuralmagic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

AutoFP8 by neuralmagicneuralmagic/AutoFP8

Repository Details

AutoFP8
by
neuralmagic

AutoFP8
by
neuralmagicneuralmagic/AutoFP8