orc
by
apache

Description: Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

View on GitHub ↗

Summary Information

Updated 40 minutes ago

Added to GitGenius on January 4th, 2025

Created on May 6th, 2015

Open Issues & Pull Requests: 20 (+4)

Number of forks: 513

Total Stargazers: 767 (+0)

Total Subscribers: 42 (+0)

Issue Activity (beta)

Open issues: 7

New in 7 days: 1

Closed in 7 days: 1

Avg open age: 102 days

Stale 30+ days: 6

Stale 90+ days: 5

Recent activity

Opened in 7 days: 1

Closed in 7 days: 1

Comments in 7 days: 2

Events in 7 days: 9

Top labels

Stale (19)
question (5)
bug (4)
RELEASE (2)
CPP (1)
beginner (1)
enhancement (1)
invalid (1)

Most active issues this week

#2650 Uncontrolled multi-TB allocation in Apache Arrow C++ ORC reader via unbounded PostScript `compression_block_size` - 7 events / 2 comments
#2670 Release ORC 1.9.9 - 2 events / 0 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: N/A

Mean response time: 25.8 days

90th percentile: 12.4 hours

Tracked items: 152

Most active contributors

dongjoon-hyun - 290 events, 130 issues
wgtmac - 37 events, 24 issues
cxzl25 - 17 events, 9 issues
williamhyun - 13 events, 8 issues
WillAyd - 8 events, 3 issues

Related by overlapping contributors

Detailed Description

Apache ORC is a columnar file format and library implementation designed specifically for Hadoop analytics workloads. The project provides both Java and C++ libraries for reading and writing ORC files, with each implementation being completely independent while maintaining full compatibility across all ORC file versions. The format is self-describing and type-aware, automatically selecting appropriate encodings based on data types and building internal indexes during the write process.

The core design of ORC optimizes for large streaming reads while maintaining the ability to quickly locate required rows. By storing data in columnar format, readers can selectively read, decompress, and process only the values needed for a specific query rather than scanning entire rows. The format supports Hive's complete type system, including complex types such as structs, lists, maps, and unions. ORC files include integrated indexing mechanisms that enable predicate pushdown optimization, allowing the system to determine which file sections need to be read for a particular query and narrow searches down to specific sets of 10,000 rows.

The repository is organized into several key subdirectories covering the C++ and Java implementations, CMake modules for building, Docker scripts for testing across different Linux distributions, example ORC files for compatibility testing, website documentation, and command-line tools for inspecting ORC files. The project maintains multiple active release branches including main, branch-2.3, branch-2.2, branch-2.1, branch-2.0, and branch-1.9, with continuous integration testing across all branches.

Building the project requires Java 17 or higher, Maven 3.9.9 or higher, and CMake 3.25.0 or higher, with optional Meson 1.3.0 support for building select components. The C++ library includes optional AVX512 SIMD optimization support that can be enabled at compile time through the BUILD_ENABLE_AVX512 flag and controlled at runtime via the ORC_USER_SIMD_LEVEL environment variable. The project supports building release versions with or without debug information, as well as building only the Java or C++ components independently.

According to GitGenius activity tracking, the repository shows a median issue and pull request response latency of 0.0 hours across 150 tracked items, with a mean latency of 624.4 hours. The most active contributor tracked is dongjoon-hyun with 287 events, followed by wgtmac with 37 events and cxzl25 with 17 events. The Stale label appears most frequently among tracked issues, with occasional enhancement and invalid labels. The repository shares overlapping contributors with Apache Arrow, pandas-dev/pandas, and ClickHouse, indicating integration points across the broader big data and analytics ecosystem.

orc
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

orc
by
apacheapache/orc

Repository Details

orc by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

orc by apacheapache/orc

Repository Details

orc
by
apache

orc
by
apacheapache/orc