dvc
by
treeverse

Description: 🦉 Data Versioning and ML Experiments

View on GitHub ↗

Summary Information

Updated 50 minutes ago

Added to GitGenius on June 4th, 2024

Created on March 4th, 2017

Open Issues & Pull Requests: 188 (+0)

Number of forks: 1,311

Total Stargazers: 15,736 (+0)

Total Subscribers: 130 (+0)

Issue Activity (beta)

Open issues: 162

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 891 days

Stale 30+ days: 156

Stale 90+ days: 145

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

bug (898)
enhancement (606)
feature request (453)
p2-medium (447)
p3-nice-to-have (381)
awaiting response (363)
p1-important (361)
A: experiments (313)

Most active issues this week

#11050 `dvc diff <rev>` reports all-deleted/0-added for outputs after splitting a single-dir `.dvc` into per-subfolder `.dvc` files (3.66.1) - 1 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 1099.6 days

Mean response time: 1020.4 days

90th percentile: 2081.2 days

Tracked items: 827

Most active contributors

skshetry - 729 events, 288 issues
shcheklein - 489 events, 119 issues
dberenbaum - 202 events, 91 issues
nablabits - 20 events, 2 issues
anunayasri - 17 events, 6 issues

Related by overlapping contributors

Detailed Description

DVC, or Data Version Control, is a command-line tool and VS Code extension designed to enable reproducible machine learning projects by managing data, models, and experiments alongside code in Git repositories. The project is written in Python and addresses a core challenge in ML development: versioning large data and model files that are impractical to store directly in Git while maintaining reproducibility and collaboration capabilities.

The tool functions as a Git-like system for data artifacts, allowing users to store and share data and models in cloud storage or on-premise networks while keeping version metadata in Git. This approach separates the concerns of code versioning through Git and data versioning through DVC's caching and remote storage system. DVC supports multiple remote storage backends including AWS S3, Azure, Google Cloud Storage, and SSH-accessible network storage, making it flexible for different infrastructure setups.

A central feature of DVC is its pipeline system, which functions similarly to Makefiles for machine learning. Pipelines define computational graphs that connect code and data together, specifying input dependencies, commands to execute, and outputs to preserve. This allows users to version their data processing and model training workflows in Git while ensuring that only impacted pipeline steps re-run when changes occur, enabling fast iteration during development.

DVC's experiment tracking capabilities allow developers to prepare and run multiple experiments locally without requiring external servers. Experiments can be compared based on hyperparameters and metrics, with results visualized through plots. The system integrates with existing Git hosting platforms like GitHub and GitLab, enabling collaboration through standard Git workflows rather than proprietary experiment management infrastructure.

The repository shows significant ongoing activity and maintenance. GitGenius tracking data indicates a median issue and pull request response latency of 26,391 hours with a mean of 24,488 hours across 827 tracked items. The most active contributors tracked include skshetry with 729 events, shcheklein with 489 events, and dberenbaum with 202 events. Bug reports represent the most common issue label with 223 occurrences, followed by priority labels p2-medium with 124 and p1-important with 114, indicating active issue management and prioritization.

The project maintains broad platform support through multiple installation methods including pip, conda, snap, Homebrew, Chocolatey, and platform-specific packages for Linux, Windows, and macOS. Optional dependencies for specific cloud storage backends can be installed as needed, allowing users to customize their installation based on their infrastructure requirements.

DVC's integration with VS Code provides a graphical interface for experiment tracking and data management directly within the IDE, with additional features planned for future releases. The tool has attracted contributions from developers across major open-source projects, as evidenced by GitGenius linking this repository to microsoft/vscode, microsoft/typescript, and rust-lang/rust through overlapping contributor networks, suggesting adoption and interest from experienced software engineers working on large-scale projects.

dvc
by
treeverse

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

dvc
by
treeversetreeverse/dvc

Repository Details

dvc by treeverse

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

dvc by treeversetreeverse/dvc

Repository Details

dvc
by
treeverse

dvc
by
treeversetreeverse/dvc