koalas
by
databricks

Description: Koalas: pandas API on Apache Spark

View on GitHub ↗

Summary Information

Updated 55 minutes ago

Added to GitGenius on January 3rd, 2025

Created on January 3rd, 2019

Open Issues & Pull Requests: 107 (+0)

Number of forks: 369

Total Stargazers: 3,372 (+0)

Total Subscribers: 310 (+0)

Issue Activity (beta)

Open issues: 99

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 1,865 days

Stale 30+ days: 99

Stale 90+ days: 99

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

enhancement (189)
bug (71)
question (45)
discussions (36)
help wanted (28)
good first issue (26)
not a koalas issue (26)
P0 (22)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 1088.7 days

Mean response time: 1141.0 days

90th percentile: 2076.8 days

Tracked items: 8

Most active contributors

HyukjinKwon - 4 events, 4 issues
itholic - 4 events, 3 issues
SchabiDesigns - 1 events, 1 issues
mishuhaque - 1 events, 1 issues

Related by overlapping contributors

Detailed Description

Koalas is a Python library developed by Databricks that implements the pandas DataFrame API on top of Apache Spark, enabling data scientists to write code using familiar pandas syntax while leveraging Spark's distributed computing capabilities for large-scale data processing. The project addresses a key productivity gap by allowing developers already proficient with pandas to work with big data without requiring them to learn Spark's native API or rewrite their code.

The core value proposition of Koalas centers on code portability and reduced learning curves. Users can maintain a single codebase that functions identically with both pandas for testing and smaller datasets and with Spark for distributed, production-scale datasets. This approach eliminates the friction typically encountered when transitioning from single-node data analysis to distributed processing, as the API surface remains consistent across both environments.

Koalas was designed with Apache Spark 3.1 and earlier versions in mind, supporting integration with Spark's distributed processing engine while maintaining pandas compatibility. The library leverages Apache Arrow for efficient data interchange between Python and Spark. Installation is straightforward through standard Python package managers like Conda and pip, with the added convenience of being pre-installed in Databricks Runtime 7.1 and above for users on the Databricks platform.

According to GitGenius activity tracking, the repository has experienced median issue and pull request response latencies of approximately 26,129 hours, with a mean latency of 27,383 hours. The most active contributors tracked include HyukjinKwon and itholic, each with four recorded events, while SchabiDesigns contributed one event. Enhancement requests and help-wanted issues represent the most common issue labels, with four and two instances respectively. The repository's contributor network overlaps with significant projects including pandas-dev/pandas, microsoft/vscode, and microsoft/typescript, indicating cross-pollination with the broader Python data science and software development ecosystems.

However, the project's status has fundamentally changed. The README explicitly states that Koalas is now deprecated and in maintenance mode, as its functionality has been officially incorporated into PySpark beginning with Apache Spark 3.2. Users working with Apache Spark 3.2 and later versions are directed to use PySpark directly rather than maintaining a separate Koalas dependency. This transition reflects the successful integration of Koalas' design principles into the official Spark Python API, effectively making the standalone library obsolete for modern Spark deployments while preserving its value for legacy systems still operating on Spark 3.1 and earlier.

The project provides comprehensive documentation including getting started guides, a ten-minute interactive tutorial available through Jupyter notebooks, contribution guidelines, design principles, frequently asked questions, and best practices documentation. These resources support both new users evaluating the library and contributors interested in extending its functionality.

koalas
by
databricks

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

koalas
by
databricksdatabricks/koalas

Repository Details

koalas by databricks

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

koalas by databricksdatabricks/koalas

Repository Details

koalas
by
databricks

koalas
by
databricksdatabricks/koalas