Daft
by
Eventual-Inc

Description: High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

View on GitHub ↗

Summary Information

Updated 37 minutes ago

Added to GitGenius on September 8th, 2025

Created on April 25th, 2022

Open Issues & Pull Requests: 341 (+1)

Number of forks: 514

Total Stargazers: 5,614 (+0)

Total Subscribers: 32 (+0)

Issue Activity (beta)

Open issues: 259

New in 7 days: 1

Closed in 7 days: 0

Avg open age: 318 days

Stale 30+ days: 222

Stale 90+ days: 190

Recent activity

Opened in 7 days: 1

Closed in 7 days: 0

Comments in 7 days: 2

Events in 7 days: 4

Top labels

enhancement (476)
bug (422)
p2 (backlog) (385)
good first issue (168)
needs triage (142)
p1 (140)
help wanted (80)
expression (75)

Most active issues this week

#5462 functions decorated with `daft.func` don't respect the input type signature - 3 events / 1 comments
#7196 List-return UDF exceptions are swallowed as "Need at least 1 series to perform concat" - 3 events / 1 comments
#3792 string expressions parity with pyspark - 1 events / 1 comments
#7197 fix: SQL literal/IsIn/Decimal expressions fail to translate to SQL - 1 events / 1 comments
#7198 fix: read_sql partition/percentile/engine issues for MySQL-compatible databases - 1 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 30.2 days

90th percentile: 46.0 days

Tracked items: 1,223

Most active contributors

universalmind303 - 1,214 events, 417 issues
colin-ho - 1,128 events, 503 issues
rchowell - 842 events, 303 issues
kevinzwang - 620 events, 202 issues
desmondcheongzx - 477 events, 163 issues

Related by overlapping contributors

Detailed Description

Daft is an open-source library developed by Eventual Inc. designed to simplify and accelerate the development of production-grade data pipelines, particularly those leveraging Apache Spark and Delta Lake. It aims to bridge the gap between the expressiveness of Pandas and the scalability of Spark, allowing data scientists and engineers to write data transformations using familiar Pandas-like syntax that seamlessly compiles down to optimized Spark code. Essentially, Daft provides a higher-level API for Spark, making it more accessible and efficient for a wider range of users.

At its core, Daft introduces the concept of a `daft.DataFrame`, which mirrors the Pandas DataFrame but operates on distributed data. This DataFrame isn't a direct wrapper around a Spark DataFrame; instead, it's a new data structure built on top of Apache Arrow, enabling zero-copy data sharing and efficient execution. Daft's compiler intelligently translates Pandas-like operations (filtering, grouping, aggregations, joins, etc.) into optimized Spark plans, leveraging Delta Lake for reliable data storage and versioning. A key benefit is automatic schema inference and type checking, reducing runtime errors common in traditional Spark development.

The library focuses heavily on performance. Daft's compiler performs several optimizations, including predicate pushdown (filtering data as early as possible), column pruning (selecting only necessary columns), and code generation. It also supports user-defined functions (UDFs) written in Python, which are automatically compiled to Spark's native execution engine for improved speed. Furthermore, Daft leverages Apache Arrow's memory model to minimize data serialization and deserialization overhead, a significant bottleneck in many Spark applications. The goal is to achieve performance comparable to, or even exceeding, hand-optimized Spark code, but with significantly less development effort.

Daft's architecture is modular and extensible. It includes a runtime that handles execution and a compiler that translates Daft code into Spark plans. The compiler is designed to be pluggable, allowing for future extensions and optimizations. The library also provides integrations with popular data science tools like Polars, enabling users to seamlessly transition between local and distributed data processing. It supports reading and writing data from various sources, including Parquet, CSV, and Delta Lake.

Currently, Daft is considered to be in active development (as of late 2023/early 2024) and is not yet a fully mature product. However, it demonstrates significant promise for simplifying Spark development and improving the performance of data pipelines. The project is actively maintained by Eventual Inc. and has a growing community of contributors. Its focus on usability, performance, and integration with existing data science workflows positions it as a potentially valuable tool for organizations looking to scale their data processing capabilities with Spark and Delta Lake, while reducing the complexity traditionally associated with those technologies.

Daft
by
Eventual-Inc

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

Daft
by
Eventual-IncEventual-Inc/Daft

Repository Details

Daft by Eventual-Inc

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

Daft by Eventual-IncEventual-Inc/Daft

Repository Details

Daft
by
Eventual-Inc

Daft
by
Eventual-IncEventual-Inc/Daft