beam
by
apache

Description: Apache Beam is a unified programming model for Batch and Streaming data processing.

View apache/beam on GitHub ↗

Summary Information

Updated 21 minutes ago

Added to GitGenius on January 3rd, 2025

Created on February 2nd, 2016

Open Issues & Pull Requests: 4,120 (+0)

Number of forks: 4,558

Total Stargazers: 8,593 (+0)

Total Subscribers: 258 (+0)

Issue Activity (beta)

Open issues: 3,704

New in 7 days: 75

Closed in 7 days: 18

Avg open age: 1,237 days

Stale 30+ days: 3,418

Stale 90+ days: 3,354

Recent activity

Opened in 7 days: 74

Closed in 7 days: 15

Comments in 7 days: 158

Events in 7 days: 681

Top labels

P3 (3,763)
bug (3,214)
P2 (2,479)
java (2,028)
python (1,765)
io (1,420)
new feature (1,249)
awaiting triage (1,210)

Most active issues this week

#38587 [Feature Request]: Add metrics tracking for offset commit failures and retry attempts in KafkaCommitOffset - 17 events / 4 comments
#33802 [Feature Request]: Support generic PTransforms in Python SDK - 11 events / 7 comments
#21100 Beam Connector for Reading Data from Delta Lake - 9 events / 0 comments
#38551 [Task]: Create the initial source with basic read support and BeamParquetHandler - 7 events / 0 comments
#38552 [Task]: Add support for splitting the source - 6 events / 0 comments

Explore full issue details

Detailed Description

Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines. It’s designed to abstract away the complexities of underlying execution engines, allowing developers to focus solely on the logic of their data transformations, regardless of whether they’re running on Apache Spark, Apache Flink, Google Cloud Dataflow, or other execution backends. This portability is Beam’s core strength, enabling teams to easily migrate their pipelines to different environments without significant code changes. At its heart, Beam operates on a dataflow model, treating data as a continuous stream or a batch of data. Pipelines are constructed using a Directed Acyclic Graph (DAG) where nodes represent transformations and edges represent the flow of data between them. Beam provides a rich set of APIs in Java, Python, and Go, offering both a Dataflow API (for more complex, stateful pipelines) and a Pipeline API (for simpler, stateless transformations). The Dataflow API is particularly well-suited for scenarios involving windowing, aggregation, and complex state management, leveraging the capabilities of the underlying execution engine. The Pipeline API is ideal for straightforward transformations like filtering, mapping, and joining datasets.

Beam’s core concepts include *PCollections*, which represent distributed datasets, and *Transforms*, which are operations that process these datasets. These Transforms are composable, meaning they can be chained together to create complex data processing workflows. Beam also incorporates concepts like *Runners*, which are responsible for executing the pipeline on a specific execution engine, and *Triggers*, which control the timing and scheduling of pipeline execution. The SDK includes a robust testing framework for verifying the correctness of pipelines, ensuring data integrity throughout the processing chain. Beam’s design emphasizes correctness and efficiency, with features like data validation, fault tolerance, and resource management built-in.

Beam’s evolution has been driven by the need for a more flexible and portable data processing framework. Initially, it was closely tied to Google Cloud Dataflow, but has since matured into a standalone SDK. The community actively contributes to the project, constantly improving the APIs, adding new features, and supporting various execution engines. The project is heavily influenced by the principles of functional programming, promoting immutability and side-effect-free transformations. Beam is increasingly popular for building data pipelines in industries like advertising, finance, and IoT, where the ability to process large volumes of data quickly and reliably is critical. The project’s success is largely due to its focus on developer productivity and its commitment to providing a truly portable and scalable data processing solution. Ultimately, Beam empowers developers to build robust and adaptable data pipelines that can meet the evolving demands of modern data-driven applications.

beam
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

beam
by
apacheapache/beam

Repository Details

beam by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

beam by apacheapache/beam

Repository Details

beam
by
apache

beam
by
apacheapache/beam