bigcode-dataset
by
bigcode-project

Description: The BigCode Dataset repository serves as the central hub for constructing and processing large-scale code datasets, most notably The Stack, which is hosted on...

View on GitHub ↗

Summary Information

Updated 32 minutes ago

Added to GitGenius on February 29th, 2024

Created on October 31st, 2022

Open Issues & Pull Requests: 17 (+0)

Number of forks: 81

Total Stargazers: 496 (+0)

Total Subscribers: 10 (+0)

Issue Activity (beta)

Open issues: 10

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 862 days

Stale 30+ days: 10

Stale 90+ days: 10

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

help wanted (8)
TF: Dataset Curation and Filtering (7)
TF: PII redaction (3)
TF: The Stack 1.1 (2)
good first issue (2)
question (2)
TF: Dataset index (1)
TF: StackOverflow (1)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Detailed Description

The BigCode Dataset repository serves as the central hub for constructing and processing large-scale code datasets, most notably The Stack, which is hosted on Hugging Face. The repository contains the complete codebase and preprocessing infrastructure necessary to build datasets suitable for training machine learning models on source code. This project is classified across multiple domains including AI research, software engineering, computational linguistics, code dataset creation, and large-scale dataset management, reflecting its role as foundational infrastructure for code-based machine learning research.

The repository is organized around several key functional areas that handle different aspects of dataset construction and preparation. The language_selection component provides notebooks and mapping files that define the relationship between programming languages and their file extensions, which was used to build The Stack v1.1. This foundational work enables systematic collection of code across different programming languages. The PII component implements detection and anonymization procedures to remove personally identifiable information from code datasets, addressing privacy concerns in large-scale code collection. The decontamination module removes files that match test samples from code generation benchmarks, preventing data leakage that could compromise the validity of model evaluation.

The preprocessing section represents the most extensive part of the repository and handles multiple filtering strategies tailored to different use cases. Basic filtering applies constraints on line length and the percentage of alphanumeric characters to remove low-quality code. More sophisticated filters consider metrics like repository star counts, the ratio of comments to code, and tokenizer fertility to identify high-quality training data. For StarCoder training specifically, the repository implements extension-dependent basic filters, XML removal, HTML filtering based on displayed text versus code ratio, and size-based filtering for JSON and YAML files. Additional preprocessing capabilities include filters designed specifically for GitHub Issues and Git Commits, allowing the dataset to incorporate different types of code-related text beyond raw source files.

The repository also handles format conversion tasks essential for preparing diverse code sources. It includes code to convert Jupyter notebooks into standalone Python scripts, enabling the inclusion of notebook-based code in training datasets. Another conversion pipeline transforms Jupyter notebooks into structured markdown-code-output triplets, preserving the narrative and explanatory context alongside executable code. This capability is particularly valuable for training models that need to understand code within its documentation context.

The primary language used in this repository is Jupyter Notebook, indicating that much of the development and documentation occurs through interactive computational notebooks. The project maintains connections with major related repositories including TensorFlow, the BigCode Evaluation Harness, and Hugging Face Datasets, as indicated by overlapping contributors. These connections reflect the repository's role within a broader ecosystem of code-based machine learning research and development. The repository's comprehensive approach to dataset construction, from language mapping through decontamination and format conversion, makes it essential infrastructure for organizations building and training large language models on code.

bigcode-dataset
by
bigcode-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

bigcode-dataset
by
bigcode-projectbigcode-project/bigcode-dataset

Repository Details

bigcode-dataset by bigcode-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

bigcode-dataset by bigcode-projectbigcode-project/bigcode-dataset

Repository Details

bigcode-dataset
by
bigcode-project

bigcode-dataset
by
bigcode-projectbigcode-project/bigcode-dataset