hadoop
by
apache

Description: Apache Hadoop

View on GitHub ↗

Summary Information

Updated 2 hours ago

Added to GitGenius on May 13th, 2026

Created on August 28th, 2014

Open Issues & Pull Requests: 161 (+0)

Number of forks: 9,235

Total Stargazers: 15,596 (+0)

Total Subscribers: 957 (+0)

Issue Activity (beta)

Open issues: 0

New in 7 days: 0

Closed in 7 days: 0

Avg open age: N/A days

Stale 30+ days: 0

Stale 90+ days: 0

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

No label distribution available yet.

Most active issues this week

No issue events were indexed in the last 7 days.

Full issues analysis pending...

Detailed Description

Apache Hadoop is a distributed computing framework written in Java that enables the processing and storage of large datasets across clusters of commodity hardware. The project provides a foundation for big data processing through its core components: HDFS (Hadoop Distributed File System) for distributed storage, MapReduce for parallel data processing, and YARN (Yet Another Resource Negotiator) for cluster management and resource scheduling. These components work together to deliver fault tolerance and data scalability, allowing organizations to build reliable systems that can handle massive volumes of data.

The repository serves as the official Apache implementation of Hadoop, maintained as an open-source project under the Apache Software Foundation. It represents the reference implementation that has become foundational infrastructure for data analytics platforms and big data ecosystems. The codebase is primarily written in Java, reflecting the language choice that has enabled broad adoption across enterprise environments and integration with the broader Java ecosystem.

HDFS addresses the challenge of storing very large files reliably across distributed systems by replicating data blocks across multiple nodes, ensuring that data remains accessible even when individual machines fail. MapReduce provides a programming model for processing large datasets in parallel by dividing work into map and reduce phases, allowing computations to be distributed across cluster nodes. YARN abstracts the cluster's computational resources and enables multiple data processing engines to share the same hardware infrastructure, moving beyond MapReduce's original resource management limitations.

The project's architecture emphasizes fault tolerance as a core design principle, recognizing that failures are inevitable in large-scale distributed systems. By automatically replicating data and redistributing work when nodes fail, Hadoop enables organizations to use inexpensive commodity hardware without sacrificing reliability. This approach fundamentally changed the economics of large-scale data processing by making it feasible to build systems from standard components rather than specialized hardware.

GitGenius activity data reveals sustained development momentum across the repository, with ongoing contributions to distributed storage, big data processing, cluster management, and resource scheduling functionality. The project maintains active issue and pull request workflows, indicating continuous refinement of its core components and ongoing responses to operational requirements from its user base. Contributor activity patterns demonstrate that Hadoop remains a collaborative effort with engagement from multiple developers working on different aspects of the system.

The repository's growth history reflects Hadoop's evolution from an initial implementation of the MapReduce and HDFS concepts into a comprehensive ecosystem component. Over time, the project has expanded to address limitations in the original design, particularly through YARN's introduction, which separated resource management from the MapReduce execution engine and enabled the platform to support diverse processing frameworks.

The official project website at hadoop.apache.org and the community wiki at cwiki.apache.org/confluence/display/HADOOP/ serve as primary resources for documentation, releases, and community information. These resources complement the repository itself by providing deployment guides, configuration documentation, and community discussions that support users implementing Hadoop in production environments. The project continues to be actively maintained and remains central to many organizations' big data infrastructure strategies.

hadoop
by
apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

hadoop
by
apacheapache/hadoop

Repository Details

hadoop by apache

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

hadoop by apacheapache/hadoop

Repository Details

hadoop
by
apache

hadoop
by
apacheapache/hadoop