parquet-format
by
apache

Description: Apache Parquet Format

View on GitHub ↗

Summary Information

Updated 22 minutes ago

Added to GitGenius on January 4th, 2025

Created on June 10th, 2014

Open Issues & Pull Requests: 87 (+0)

Number of forks: 494

Total Stargazers: 2,483 (+0)

Total Subscribers: 67 (+0)

Issue Activity (beta)

Open issues: 58

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 1,389 days

Stale 30+ days: 56

Stale 90+ days: 51

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

Priority: Major (121)
Type: enhancement (112)
Type: bug (63)
Priority: Minor (33)
Type: task (30)
Priority: Trivial (9)
Priority: Critical (5)
Priority: Blocker (2)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 45.6 days

Mean response time: 932.2 days

90th percentile: 2980.0 days

Tracked items: 106

Most active contributors

wgtmac - 111 events, 56 issues
alamb - 32 events, 16 issues
emkornfield - 21 events, 15 issues
asfimport - 17 events, 6 issues
pitrou - 16 events, 4 issues

Related by overlapping contributors

Detailed Description

The apache/parquet-format repository contains the formal specification and Thrift definitions for Apache Parquet, an open-source columnar data file format designed for efficient storage and retrieval in the Hadoop ecosystem. The repository is written primarily in Thrift and serves as the authoritative reference for how Parquet files should be structured, encoded, and interpreted across different programming languages and analytics tools.

Apache Parquet was created to provide compressed, efficient columnar data representation to any project in the Hadoop ecosystem. The format is built from the ground up to handle complex nested data structures using the record shredding and assembly algorithm described in the Dremel paper, which the project considers superior to simple flattening of nested namespaces. The format supports very efficient compression and encoding schemes that can be specified on a per-column level, allowing flexibility as new compression techniques are invented and implemented.

The repository defines the hierarchical structure of Parquet files, which consist of one or more row groups, with each row group containing exactly one column chunk per column, and each column chunk containing one or more pages. Files begin with a four-byte magic number "PAR1", followed by column chunks organized by row group, with file metadata written at the end to enable single-pass writing. The format supports eight primitive types: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, and FIXED_LEN_BYTE_ARRAY. Logical types extend these primitives to support additional data representations like strings, while maintaining minimal complexity in reader and writer implementations.

The specification covers critical technical details including nested encoding using definition and repetition levels based on the Dremel encoding scheme, null value handling through definition levels with run-length encoding, and data page structure with encoded values, definition levels, and repetition levels. The repository documents supported encodings and compression codecs in separate specification files.

Activity tracking shows this is an actively maintained project with significant community engagement. The most active contributor tracked by GitGenius is wgtmac with 111 events, followed by alamb with 32 events and emkornfield with 21 events. Across 106 tracked issues and pull requests, the median response latency is 1094.3 hours, though the mean of 22373.1 hours reflects occasional longer-running discussions. Enhancement requests dominate the issue tracker with 70 labeled items, followed by 46 items marked as Priority: Major and 24 bug reports. The repository maintains overlapping contributors with apache/arrow, apache/datafusion, and python/cpython, indicating deep integration within the broader data processing ecosystem.

The parquet-format project specifically focuses on format specifications and metadata definitions, while related projects like parquet-java provide implementation details and parquet-testing offers cross-language verification files. The repository can be built using Maven for Java resources and Make for C++ Thrift resources, with code generation available for any Thrift-supported language. This design ensures that Parquet implementations across different languages can reliably read and write each other's files while maintaining compatibility with the formal specification.

parquet-format
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

parquet-format
by
apacheapache/parquet-format

Repository Details

parquet-format by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

parquet-format by apacheapache/parquet-format

Repository Details

parquet-format
by
apache

parquet-format
by
apacheapache/parquet-format