Description: High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
View eventual-inc/daft on GitHub ↗
Daft is an open-source library developed by Eventual Inc. designed to simplify and accelerate the development of production-grade data pipelines, particularly those leveraging Apache Spark and Delta Lake. It aims to bridge the gap between the expressiveness of Pandas and the scalability of Spark, allowing data scientists and engineers to write data transformations using familiar Pandas-like syntax that seamlessly compiles down to optimized Spark code. Essentially, Daft provides a higher-level API for Spark, making it more accessible and efficient for a wider range of users.
At its core, Daft introduces the concept of a `daft.DataFrame`, which mirrors the Pandas DataFrame but operates on distributed data. This DataFrame isn't a direct wrapper around a Spark DataFrame; instead, it's a new data structure built on top of Apache Arrow, enabling zero-copy data sharing and efficient execution. Daft's compiler intelligently translates Pandas-like operations (filtering, grouping, aggregations, joins, etc.) into optimized Spark plans, leveraging Delta Lake for reliable data storage and versioning. A key benefit is automatic schema inference and type checking, reducing runtime errors common in traditional Spark development.
The library focuses heavily on performance. Daft's compiler performs several optimizations, including predicate pushdown (filtering data as early as possible), column pruning (selecting only necessary columns), and code generation. It also supports user-defined functions (UDFs) written in Python, which are automatically compiled to Spark's native execution engine for improved speed. Furthermore, Daft leverages Apache Arrow's memory model to minimize data serialization and deserialization overhead, a significant bottleneck in many Spark applications. The goal is to achieve performance comparable to, or even exceeding, hand-optimized Spark code, but with significantly less development effort.
Daft's architecture is modular and extensible. It includes a runtime that handles execution and a compiler that translates Daft code into Spark plans. The compiler is designed to be pluggable, allowing for future extensions and optimizations. The library also provides integrations with popular data science tools like Polars, enabling users to seamlessly transition between local and distributed data processing. It supports reading and writing data from various sources, including Parquet, CSV, and Delta Lake.
Currently, Daft is considered to be in active development (as of late 2023/early 2024) and is not yet a fully mature product. However, it demonstrates significant promise for simplifying Spark development and improving the performance of data pipelines. The project is actively maintained by Eventual Inc. and has a growing community of contributors. Its focus on usability, performance, and integration with existing data science workflows positions it as a potentially valuable tool for organizations looking to scale their data processing capabilities with Spark and Delta Lake, while reducing the complexity traditionally associated with those technologies.
Fetching additional details & charts...