orc
by
apache

Description: Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

View apache/orc on GitHub ↗

Summary Information

Updated 36 minutes ago
Added to GitGenius on January 4th, 2025
Created on May 6th, 2015
Open Issues/Pull Requests: 16 (-1)
Number of forks: 508
Total Stargazers: 765 (+0)
Total Subscribers: 43 (+0)
Detailed Description

Apache ORC (Optimized Row Columnar) is a columnar storage format developed by Apache Software Foundation, designed for high-performance data warehousing and analytics. It’s fundamentally built to optimize storage and retrieval for analytical workloads, particularly those dealing with large datasets. Unlike traditional row-oriented formats like CSV or Parquet, ORC stores data column-wise, which dramatically improves performance when queries only need to access a subset of columns – a common scenario in data warehousing. This column-oriented structure allows for efficient compression and reduces I/O operations, leading to faster query execution times.

At its core, ORC’s design focuses on minimizing I/O. It achieves this through several key features. First, it supports variable-length column types, meaning columns can be defined with different data types, reducing storage overhead. Second, it utilizes a ‘metadata’ layer that describes the data types, compression algorithms, and encoding schemes used for each column. This metadata is stored alongside the data, allowing the query engine to quickly determine the optimal way to read the data. Third, ORC supports various compression algorithms, including Snappy, Zstd, and LZ4, further reducing storage space and improving I/O performance. The choice of compression algorithm can be configured on a per-table basis.

ORC is designed to be highly compatible with popular data processing engines like Apache Spark, Apache Hive, Presto, and Impala. These engines have built-in connectors and optimizers specifically tailored for ORC, allowing them to leverage its performance benefits. The ORC format is also designed to be backward compatible with Parquet, allowing existing Parquet data to be seamlessly migrated to ORC without requiring significant code changes. This ease of migration is a significant advantage.

Furthermore, ORC incorporates features like ‘zoneless’ and ‘zoned’ storage. ‘Zoned’ ORC is the standard, offering the most flexibility and compatibility. ‘Zoned’ ORC uses a ‘zone’ structure to organize data, allowing for efficient data skipping during queries. ‘Zoneless’ ORC, introduced later, simplifies the storage structure and can improve performance in certain scenarios, particularly with smaller datasets or when dealing with frequent schema changes. The project is actively maintained and continuously evolving, incorporating new features and optimizations based on user feedback and evolving analytical workloads.

Ultimately, Apache ORC represents a significant advancement in data storage for analytical applications. Its columnar format, combined with features like variable-length types, compression, and compatibility with major data processing engines, makes it a powerful tool for optimizing performance and reducing storage costs in data warehousing environments. The ongoing development and community support ensure its continued relevance and effectiveness in the face of increasingly complex analytical demands.

orc
by
apacheapache/orc

Repository Details

Fetching additional details & charts...