bigcode-evaluation-harness
by
bigcode-project

Description: A framework for the evaluation of autoregressive code generation language models.

View on GitHub ↗

Summary Information

Updated 46 minutes ago

Added to GitGenius on February 29th, 2024

Created on August 9th, 2022

Open Issues & Pull Requests: 97 (+0)

Number of forks: 263

Total Stargazers: 1,053 (+0)

Total Subscribers: 10 (+0)

Issue Activity (beta)

Open issues: 60

New in 7 days: 1

Closed in 7 days: 1

Avg open age: 650 days

Stale 30+ days: 60

Stale 90+ days: 59

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

good first issue (10)
enhancement (7)
help wanted (4)

Most active issues this week

#322 Community task: Helium open benchmarks - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 17.3 days

Mean response time: 115.3 days

90th percentile: 408.8 days

Tracked items: 58

Most active contributors

loubnabnl - 33 events, 25 issues
Vipitis - 5 events, 2 issues
alat-rights - 4 events, 2 issues
ankush13r - 4 events, 1 issues
virt9 - 4 events, 2 issues

Related by overlapping contributors

Detailed Description

The bigcode-evaluation-harness is a Python framework designed specifically for evaluating autoregressive code generation language models. Built as an adaptation of EleutherAI's lm-evaluation-harness but tailored for code-specific tasks, it provides a comprehensive toolkit for assessing how well models generate code across diverse benchmarks and programming languages.

The framework supports any autoregressive model available on the Hugging Face hub, with particular emphasis on code-specialized models like SantaCoder, InCoder, and CodeGen. It enables multi-GPU text generation through the accelerate library and provides Docker containers for secure and reproducible evaluation environments. The architecture allows users to separate generation and evaluation steps, which is practical since generation benefits from GPU acceleration while evaluation can run on CPUs.

The task coverage is extensive and multilingual. For Python specifically, the harness includes seven code generation tasks with unit tests: HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000, all supporting both completion and fill-in-the-middle modes. Beyond Python, it provides HumanEvalPack which extends HumanEval across six languages through human translations, and MultiPL-E which translates HumanEval into eighteen programming languages. The framework also includes Recode for evaluating model robustness, Pal for program-aided language models on mathematical reasoning tasks like GSM8K and GSM-HARD, and code-to-text tasks from CodeXGLUE covering Python, Go, Ruby, Java, JavaScript, and PHP. Additional benchmarks include CoNaLa for Python, Concode for Java, multilingual classification tasks for Java complexity prediction and code equivalence, SantaCoder-FIM for fill-in-the-middle evaluation, and Mercury for computational efficiency assessment.

The repository shows active maintenance and community engagement. GitGenius tracking reveals a median issue and pull request response latency of 414 hours with a mean of 2766.6 hours across 58 tracked items. The most active contributor is loubnabnl with 33 recorded events, followed by Vipitis with 5 events and alat-rights with 4 events. The project maintains connections with related repositories including nousresearch/hermes-agent, vllm-project/vllm, and pytorch/pytorch through overlapping contributors, indicating integration within the broader code generation and model evaluation ecosystem.

The framework is designed for flexibility in usage patterns. Users can perform generation and evaluation together, generation only for later evaluation on different hardware, or evaluation only if generations already exist. Configuration options include precision specification, 8-bit and 4-bit quantization support through bitsandbytes, and customizable parameters like maximum generation length and number of samples. The harness saves generations and references in JSON format for reproducibility and further analysis. For tasks not requiring code execution such as BLEU-evaluated benchmarks, the framework provides streamlined evaluation paths. The project includes comprehensive documentation in docs/README.md and contribution guidelines in CONTRIBUTING.md and docs/guide.md, actively welcoming community contributions for bug fixes, feature enhancements, and new benchmark additions.

bigcode-evaluation-harness
by
bigcode-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

bigcode-evaluation-harness
by
bigcode-projectbigcode-project/bigcode-evaluation-harness

Repository Details

bigcode-evaluation-harness by bigcode-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

bigcode-evaluation-harness by bigcode-projectbigcode-project/bigcode-evaluation-harness

Repository Details

bigcode-evaluation-harness
by
bigcode-project

bigcode-evaluation-harness
by
bigcode-projectbigcode-project/bigcode-evaluation-harness