spark
by
apache

Description: Apache Spark - A unified analytics engine for large-scale data processing

View on GitHub ↗

Summary Information

Updated 55 seconds ago

Added to GitGenius on June 12th, 2023

Created on February 25th, 2014

Open Issues & Pull Requests: 440 (+0)

Number of forks: 29,269

Total Stargazers: 43,596 (+0)

Total Subscribers: 1,987 (+0)

Issue Activity (beta)

Open issues: 51

New in 7 days: 1

Closed in 7 days: 1

Avg open age: 17 days

Stale 30+ days: 37

Stale 90+ days: 7

Recent activity

Opened in 7 days: 0

Closed in 7 days: 1

Comments in 7 days: 0

Events in 7 days: 2

Top labels

Stale (11)

Most active issues this week

#54999 AES-GCM for RPC encryption does not work on YARN - 2 events / 0 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 12.2 hours

Mean response time: 4.1 days

90th percentile: 10.5 days

Tracked items: 70

Most active contributors

gaogaotiantian - 21 events, 16 issues
pan3793 - 19 events, 10 issues
HyukjinKwon - 11 events, 5 issues
joshuaharris951-cmd - 7 events, 4 issues
IMarvinTPA - 6 events, 3 issues

Related by overlapping contributors

Detailed Description

Apache Spark is a unified analytics engine for large-scale data processing maintained by the Apache Software Foundation. Written primarily in Scala, it provides high-level APIs across multiple programming languages including Scala, Java, Python, and R, enabling developers to write distributed data processing applications in their language of choice. The engine supports general computation graphs for data analysis and is optimized for both batch and streaming workloads.

The repository encompasses a comprehensive ecosystem of tools built on top of the core Spark engine. Spark SQL enables users to query structured data using SQL syntax and work with DataFrames, providing a familiar interface for data analysts. The pandas API on Spark allows users to leverage pandas-style operations on distributed datasets, bridging the gap between single-machine pandas workflows and large-scale distributed processing. MLlib provides machine learning capabilities including classification, regression, clustering, and collaborative filtering algorithms. GraphX enables graph processing and analysis, while Structured Streaming supports real-time data processing with fault tolerance and exactly-once semantics.

The codebase is classified across multiple technical domains reflecting its broad functionality: distributed computing, real-time analysis, SQL queries, streaming analytics, and machine learning. The repository maintains APIs for Java, Python, and Scala, making it accessible to diverse developer communities. The core abstraction of Resilient Distributed Datasets (RDDs) underpins the entire system, though higher-level DataFrame and Dataset APIs provide more optimized and user-friendly interfaces for most use cases.

Recent activity tracked by GitGenius shows the project maintains active development with 422 open issues as of the latest check. The issue and pull request response latency averages 98.9 hours with a median of 12.2 hours, indicating a responsive maintenance process despite the project's scale. The most active contributors tracked include gaogaotiantian with 21 events, pan3793 with 19 events, and HyukjinKwon with 11 events, demonstrating consistent community engagement. The Stale label appears frequently among tracked issues, suggesting the project manages long-standing issues through automated processes.

The repository shows interconnected development with other major data processing projects. GitGenius identifies overlapping contributors with pola-rs/polars, delta-io/delta, and python/cpython, indicating that developers working on Spark often contribute to complementary data processing and Python ecosystem projects. This cross-pollination reflects Spark's central role in the broader data engineering and analytics landscape.

The project maintains comprehensive documentation available both on the official Spark website and through a development version, supporting users from beginners to advanced practitioners. The combination of batch processing, streaming, SQL, machine learning, and graph processing capabilities within a single unified engine distinguishes Spark as a comprehensive platform for diverse data processing requirements at scale.

spark
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

spark
by
apacheapache/spark

Repository Details

spark by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

spark by apacheapache/spark

Repository Details

spark
by
apache

spark
by
apacheapache/spark