modin
by
modin-project

Description: Modin: Scale your Pandas workflows by changing a single line of code

View on GitHub ↗

Summary Information

Updated 36 minutes ago

Added to GitGenius on January 5th, 2025

Created on June 21st, 2018

Open Issues & Pull Requests: 712 (+0)

Number of forks: 674

Total Stargazers: 10,392 (+0)

Total Subscribers: 109 (+0)

Issue Activity (beta)

Open issues: 226

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 1,185 days

Stale 30+ days: 226

Stale 90+ days: 225

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

bug 🦗 (609)
new feature/request 💬 (211)
P2 (163)
HDK (138)
pandas concordance 🐼 (127)
P3 (116)
P1 (112)
Code Quality 💯 (108)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 176.5 days

90th percentile: 881.6 days

Tracked items: 376

Most active contributors

sfc-gh-mvashishtha - 265 events, 87 issues
YarShev - 250 events, 134 issues
anmyachev - 195 events, 88 issues
sfc-gh-joshi - 174 events, 52 issues
sfc-gh-jkew - 71 events, 28 issues

Related by overlapping contributors

Detailed Description

Modin is a Python library that functions as a drop-in replacement for pandas, enabling users to accelerate their data analysis workflows by distributing computation across multiple cores and machines. The core value proposition is simplicity: users can scale their existing pandas code by changing only the import statement from pandas to modin.pandas, without needing to rewrite their analysis logic or understand distributed computing concepts.

The library addresses a fundamental limitation of pandas, which operates as a single-threaded application and becomes prohibitively slow or runs out of memory when processing large datasets. Modin transparently distributes data and computation across available system resources, allowing the same pandas API to operate on larger datasets and multi-core systems. According to the repository documentation, Modin can deliver speedups of up to 4x on a laptop with 4 physical cores, with even greater improvements on larger datasets measured in gigabytes.

Modin supports multiple compute engines for distributed execution: Ray, Dask, and MPI through unidist. Users can select their preferred engine via the MODIN_ENGINE environment variable, and Modin automatically detects which engines are installed. This flexibility allows users to choose the execution backend that best fits their infrastructure without changing application code. The library works in local mode on single machines, automatically creating and managing a local cluster, as well as on distributed systems.

The pandas API coverage is substantial, with DataFrame operations achieving 90.8 percent coverage and Series operations at 88.05 percent across all supported engines. The library implements common I/O operations including pd.read_csv, pd.read_parquet, pd.read_sql, pd.read_feather, and pd.read_excel. Some operations like pd.read_json remain partially implemented, with the project tracking these gaps as open issues.

GitGenius activity data reveals active development and maintenance, with 376 tracked issues and pull requests showing a median response latency of 0.0 hours and a mean of 4235.6 hours. The most frequently labeled issues are bugs (155 occurrences), new feature requests (65), and P2 priority items (57). The project's core contributors include sfc-gh-mvashishtha with 265 tracked events, YarShev with 250 events, and anmyachev with 195 events. The repository shares overlapping contributors with pandas-dev/pandas, pola-rs/polars, and ray-project/ray, indicating integration with the broader data science ecosystem.

The project maintains comprehensive documentation at modin.readthedocs.io and provides community support through Slack, Stack Overflow, and Twitter. Installation is available through PyPI and conda-forge, with conda-forge offering a modin-all package that includes Ray, Dask, and MPI engines. The library is classified across multiple domains including distributed computing, parallel processing, big data analysis, performance improvement, and machine learning support, reflecting its broad applicability to data science workflows requiring scalability and efficiency.

modin
by
modin-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

modin
by
modin-projectmodin-project/modin

Repository Details

modin by modin-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

modin by modin-projectmodin-project/modin

Repository Details

modin
by
modin-project

modin
by
modin-projectmodin-project/modin