datafusion-ballista
by
apache

Description: Apache DataFusion Ballista Distributed Query Engine

View on GitHub ↗

Summary Information

Updated 58 minutes ago

Added to GitGenius on January 3rd, 2025

Created on May 19th, 2022

Open Issues & Pull Requests: 162 (+0)

Number of forks: 298

Total Stargazers: 2,077 (+0)

Total Subscribers: 43 (+0)

Issue Activity (beta)

Open issues: 114

New in 7 days: 10

Closed in 7 days: 5

Avg open age: 807 days

Stale 30+ days: 91

Stale 90+ days: 67

Recent activity

Opened in 7 days: 10

Closed in 7 days: 4

Comments in 7 days: 11

Events in 7 days: 24

Top labels

enhancement (362)
bug (155)
good first issue (37)
help wanted (34)
documentation (11)
TUI (8)
development-process (6)
performance (6)

Most active issues this week

#1943 Shuffle fetch exhausts ephemeral ports at high target_partitions (client connection caching off by default) - 8 events / 4 comments
#1944 Performance snapshot: Ballista vs Spark/Comet for TPC-H @ SF100 - 5 events / 4 comments
#1829 Add support for `DataFrame.cache()` to Ballista - 4 events / 2 comments
#1923 Implement equivalent of Spark History Server - 3 events / 1 comments
#1937 Add suport for `DataFrame.checkpoint()` - 2 events / 0 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 243.7 days

90th percentile: 1080.3 days

Tracked items: 287

Most active contributors

milenkovicm - 659 events, 224 issues
andygrove - 158 events, 84 issues
martin-g - 85 events, 31 issues
killzoner - 27 events, 9 issues
Dandandan - 22 events, 14 issues

Related by overlapping contributors

Detailed Description

Apache DataFusion Ballista is a distributed query execution engine built in Rust that extends Apache DataFusion by enabling parallelized execution of workloads across multiple nodes. The project allows existing DataFusion applications to be distributed with minimal code changes, making it accessible for users who want to scale their query processing without major architectural rewrites.

The Ballista architecture consists of scheduler processes and executor processes that can run as native binaries or Docker containers, with deployment options including Docker Compose and Kubernetes. Clients submit jobs to the scheduler, which coordinates task distribution to executors that report back on task status and completion. The system is designed to handle complex SQL queries including CTEs, joins, and subqueries at scale, though the project documentation acknowledges an ongoing gap between DataFusion and Ballista functionality that the community is actively working to close.

Performance benchmarks derived from TPC-H queries demonstrate significant optimization progress. Testing at scale factor 100 with 100 GB of data on a single node with one executor and eight concurrent tasks shows an overall speedup of 2.9x compared to Apache Spark. Individual query performance varies, with some queries showing substantially higher relative speedups than others, indicating that optimization efforts have been particularly effective in certain query patterns.

The codebase is organized into multiple Cargo feature-gated components. The ballista client crate includes a standalone mode feature for in-process scheduler and executor operation. The ballista-core crate provides Arrow IPC optimizations for shuffle performance and optional Spark compatibility mode. The ballista-scheduler component supports optional features including Substrait plan support, Prometheus metrics collection, execution graph visualization, Kubernetes Event Driven Autoscaling integration, REST API endpoints, and stage plan caching control. The ballista-executor crate includes the mimalloc memory allocator for performance optimization alongside Arrow IPC improvements. The ballista-cli component provides a terminal user interface for REST client interactions.

GitGenius activity tracking shows the project maintains active development with a median issue and pull request response latency of 0.0 hours and a mean latency of 5909.8 hours across 284 tracked items, indicating rapid initial responses followed by longer resolution timelines for complex issues. Enhancement requests represent the most common issue type with 147 tracked items, followed by 83 bug reports and 33 good first issue designations. The project's primary contributor milenkovicm has logged 659 events, with andygrove contributing 148 events and martin-g contributing 85 events. The project shares contributors with apache/datafusion, pingcap/tidb, and nvidia/cudf-spark, indicating cross-pollination with other major distributed data processing systems.

datafusion-ballista
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datafusion-ballista
by
apacheapache/datafusion-ballista

Repository Details

datafusion-ballista by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datafusion-ballista by apacheapache/datafusion-ballista

Repository Details

datafusion-ballista
by
apache

datafusion-ballista
by
apacheapache/datafusion-ballista