Description: A composable and fully extensible C++ execution engine library for data management systems.
Velox is a powerful, open-source C++ library designed to serve as a composable execution engine for data management systems. Developed initially by Meta and now supported by a collaborative community including IBM, Intel, and Microsoft, Velox provides a flexible and high-performance foundation for building data processing systems across various analytical workloads, such as batch processing, interactive queries, stream processing, and AI/ML.
The core purpose of Velox is to provide developers with reusable and extensible components for building custom data processing engines. It is not intended for direct end-user interaction, lacking a built-in SQL parser, dataframe layer, or query optimizer. Instead, Velox empowers developers to integrate and optimize their compute engines by offering a comprehensive set of building blocks.
The library's main features are centered around several key components. The **Type** system provides a generic typing system that supports scalar, complex, and nested data types, enabling the representation of diverse data structures. The **Vector** module offers an Arrow-compatible columnar memory layout, optimizing data storage and access with encodings like Flat, Dictionary, and Constant, along with lazy materialization and out-of-order write support. The **Expression Eval** engine is a fully vectorized expression evaluation engine that efficiently executes expressions on top of Vector/Arrow encoded data.
Furthermore, Velox includes a rich set of **Functions**, encompassing vectorized scalar, aggregate, and window functions, adhering to Presto and Spark semantics. **Operators** implement relational operations such as scans, writes, projections, filtering, grouping, ordering, and joins, providing the building blocks for complex query execution. The **I/O** component offers a connector interface for diverse data sources and sinks, supporting various file formats (ORC/DWRF, Parquet, Nimble) and storage adapters (S3, HDFS, GCS, ABFS, local files). **Network Serializers** enable efficient data transfer through different wire protocols, supporting PrestoPage and Spark's UnsafeRow. Finally, **Resource Management** provides primitives for handling computational resources, including memory arenas, buffer management, task and driver management, thread pools, and mechanisms for spilling and caching.
A significant advantage of Velox is its extensibility. Developers can customize the engine by defining their own specializations, including custom types, simple and vectorized functions, aggregate functions, window functions, operators, file formats, storage adapters, and network serializers. This modular design allows developers to tailor Velox to specific needs and optimize performance for particular workloads.
The repository provides comprehensive documentation, including developer guides and examples, to facilitate integration and customization. The project is actively supported by a community, with communication channels including Slack, GitHub Issues, and Discussions. The project is licensed under the Apache 2.0 License. The repository also includes detailed instructions for getting started, including setting up dependencies and building the library on various operating systems (Linux and macOS), along with instructions for building with Docker Compose. The documentation also provides information on supported compilers and build metrics.
Fetching additional details & charts...