datasets
by
huggingface

Description: 🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

View on GitHub ↗

Summary Information

Updated 8 minutes ago

Added to GitGenius on November 21st, 2023

Created on March 26th, 2020

Open Issues & Pull Requests: 1,163 (+0)

Number of forks: 3,290

Total Stargazers: 21,698 (+0)

Total Subscribers: 281 (+0)

Issue Activity (beta)

Open issues: 858

New in 7 days: 2

Closed in 7 days: 1

Avg open age: 927 days

Stale 30+ days: 841

Stale 90+ days: 790

Recent activity

Opened in 7 days: 1

Closed in 7 days: 0

Comments in 7 days: 1

Events in 7 days: 3

Top labels

bug (610)
enhancement (474)
dataset request (136)
dataset-viewer (88)
dataset bug (64)
good first issue (52)
documentation (28)
duplicate (28)

Most active issues this week

#8296 [security] Symlink-Following Arbitrary File Write via Archive Extraction in huggingface/datasets - 4 events / 2 comments
#8256 Dataset Viewer fails on TSFile datasets - 2 events / 2 comments
#8300 Dataset spotlight: Helium benchmarks - 2 events / 1 comments
#8293 Support concatenating multiple streaming datasets while preserving the sum of shards - 1 events / 1 comments
#8302 Turkish DIK-style Syntactic Structure Benchmark Dataset - 1 events / 0 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 30.7 hours

Mean response time: 191.1 days

90th percentile: 802.0 days

Tracked items: 806

Most active contributors

lhoestq - 672 events, 357 issues
albertvillanova - 278 events, 109 issues
ArjunJagdale - 50 events, 33 issues
jbbqqf - 34 events, 34 issues
mariosasko - 33 events, 16 issues

Related by overlapping contributors

Detailed Description

The Hugging Face Datasets library is a lightweight Python package that serves as the largest hub of ready-to-use datasets for AI models, providing fast and efficient data manipulation tools. The library centers on two core features: one-line dataloaders for public datasets and efficient data preprocessing capabilities. Users can load datasets with simple commands like load_dataset("rajpurkar/squad"), instantly accessing datasets across 467 languages and dialects, including image datasets, audio datasets, text datasets, 3D medical images, video datasets, and agent traces. The preprocessing functionality allows users to prepare data for training and evaluation through simple commands like dataset.map(process_example).

The library supports an extensive range of file formats natively, including CSV, JSON, JSONL, Parquet, Arrow, XML, Text, Webdataset, and more. Multi-modal data support is built in, covering text, audio, image, video, PDF, and NIfTI 3D medical data types. A streaming mode enables users to iterate over data on-the-fly without downloading entire datasets, with performance improvements up to 100x faster when using the Xet backend. The Apache Arrow backend provides zero-copy memory-mapped storage, naturally freeing users from RAM limitations. Additional capabilities include smart caching that automatically reuses processed results, multi-framework interoperability with NumPy, Pandas, Polars, Arrow, PyTorch, TensorFlow, JAX, and Spark, and multi-processing support for fast parallel data processing.

The core architecture provides two main dataset classes: Dataset for in-memory or memory-mapped datasets backed by Apache Arrow with support for indexing and caching, and IterableDataset for lazy, streamable datasets suited for large-scale out-of-core processing. Both classes are wrapped in DatasetDict or IterableDatasetDict for handling multi-split datasets like train, test, and validation splits. Additional features include built-in FAISS and Elasticsearch index support for similarity search, flexible JSON type support for structured data, and direct read-write capabilities to Hugging Face Storage Buckets for mutable large-scale raw data.

According to GitGenius activity tracking, the repository shows strong community engagement with a median issue and pull request response latency of 30.7 hours across 806 tracked items. Enhancement requests represent the most active issue category with 167 items, followed by bug reports with 63 items and dataset requests with 14 items. The top contributor lhoestq has logged 672 events, with albertvillanova contributing 278 events and ArjunJagdale 50 events. The repository maintains overlapping contributors with major projects including microsoft/vscode, microsoft/typescript, and rust-lang/rust, indicating cross-project collaboration within the broader developer ecosystem.

The library is designed to enable community contributions, with detailed documentation for adding new datasets to the Hub. Users can upload datasets through web browsers, Python scripts, or Git-based workflows. The project maintains strict code standards using Ruff for linting and requires comprehensive testing and documentation for contributions. The library is distributed under an open license and includes a Contributor Covenant 2.0 code of conduct, with a published research paper and versioned Zenodo DOIs available for citation and reproducibility purposes.

datasets
by
huggingface

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datasets
by
huggingfacehuggingface/datasets

Repository Details

datasets by huggingface

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datasets by huggingfacehuggingface/datasets

Repository Details

datasets
by
huggingface

datasets
by
huggingfacehuggingface/datasets