datafusion
by
apache

Description: Apache DataFusion SQL Query Engine

View on GitHub ↗

Summary Information

Updated 40 minutes ago

Added to GitGenius on November 11th, 2024

Created on April 17th, 2021

Open Issues & Pull Requests: 2,025 (-1)

Number of forks: 2,217

Total Stargazers: 8,961 (+0)

Total Subscribers: 113 (+0)

Issue Activity (beta)

Open issues: 1,641

New in 7 days: 13

Closed in 7 days: 8

Avg open age: 435 days

Stale 30+ days: 1,474

Stale 90+ days: 1,231

Recent activity

Opened in 7 days: 10

Closed in 7 days: 4

Comments in 7 days: 24

Events in 7 days: 74

Top labels

enhancement (3,936)
bug (2,630)
good first issue (657)
help wanted (178)
performance (141)
documentation (118)
regression (83)
development-process (65)

Most active issues this week

#23322 chore: Explore combining `SessionConfig` and `RuntimeConfig` into same framework - 12 events / 2 comments
#19337 Dynamically narrow logical types in Parquet writer - 8 events / 3 comments
#23194 Add AQE to DataFusion - 8 events / 4 comments
#23307 SQL / slt coverage for IN lists - 7 events / 0 comments
#23317 SQL Unparser generates incorrect column references - 7 events / 2 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 169.1 days

90th percentile: 940.1 days

Tracked items: 5,522

Most active contributors

alamb - 8,094 events, 2,798 issues
Jefffrey - 855 events, 478 issues
adriangb - 825 events, 311 issues
jayzhan211 - 777 events, 340 issues
comphead - 748 events, 315 issues

Related by overlapping contributors

Detailed Description

Apache DataFusion is an extensible query engine written in Rust that leverages Apache Arrow as its in-memory columnar format. The project provides libraries and binaries for developers building fast and feature-rich database and analytic systems, with the ability to customize the engine for particular workloads. The core engine offers SQL and DataFrame APIs, a full query planner, a columnar streaming multi-threaded vectorized execution engine, and support for partitioned data sources. Built-in support includes CSV, Parquet, JSON, and Avro formats, with extensive customization capabilities at nearly all points including data sources, query languages, functions, and custom operators.

The repository maintains active development with significant community engagement. According to GitGenius tracking data, the project has processed 5519 issues and pull requests with a median response latency of 0.0 hours and a mean latency of 4061.1 hours. The most frequently applied issue labels are enhancement with 2682 occurrences, bug with 1791 occurrences, and good first issue with 429 occurrences, indicating a healthy balance between feature development and bug fixes with explicit pathways for new contributors. The project's top contributor alamb has logged 8094 events, followed by Jefffrey with 854 events and adriangb with 825 events, demonstrating concentrated expertise among key maintainers.

DataFusion's ecosystem extends beyond the core Rust implementation through multiple language bindings and specialized projects. DataFusion Python provides a Python interface for SQL and DataFrame queries, DataFusion Java offers Java bindings, and DataFusion Comet serves as an accelerator for Apache Spark based on the DataFusion engine. The project is classified across multiple analytical domains including analytics platforms, data integration, data processing, scalable queries, distributed computing, streaming data, ETL workflows, batch processing, query optimization, and real-time analytics. GitGenius identifies overlapping contributors with trinodb/trino, lance-format/lance, and apache/datafusion-ballista, indicating cross-pollination within the broader data processing ecosystem.

The crate provides extensive customization through configurable features. Default features include nested expressions for working with complex types, compression support for multiple formats, cryptographic and datetime functions, encoding functions, Parquet and SQL support, regular expression functions, Unicode-aware operations, and logical plan unparsing. Optional features add Apache Avro support, backtrace information in error messages, Parquet Modular Encryption, and serialization capabilities. The project follows Apache Software Foundation licensing under the Apache License 2.0 and maintains a committed Cargo.lock file with regular dependency updates via Dependabot, ensuring reproducible builds and managed dependency evolution.

datafusion
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datafusion
by
apacheapache/datafusion

Repository Details

datafusion by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

datafusion by apacheapache/datafusion

Repository Details

datafusion
by
apache

datafusion
by
apacheapache/datafusion