hive
by
apache

Description: Apache Hive

View apache/hive on GitHub ↗

Summary Information

Updated 58 minutes ago
Added to GitGenius on January 3rd, 2025
Created on May 21st, 2009
Open Issues/Pull Requests: 55 (+0)
Number of forks: 4,790
Total Stargazers: 6,013 (+0)
Total Subscribers: 310 (+0)
Detailed Description

The Apache Hive project is an open-source data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Originally developed by Facebook, it was donated to the Apache Software Foundation in 2010 and has since been maintained under the Apache License.

Hive is designed for both batch processing and interactive queries on top of Hadoop's distributed storage system. It allows users to define tables over large datasets stored in various data formats such as text files, ORC, Parquet, and Avro. These tables can be queried using HiveQL, which compiles into MapReduce jobs or Spark SQL tasks behind the scenes, allowing for efficient processing of big data.

A key feature of Apache Hive is its ability to handle schema evolution. Users can evolve table schemas over time without needing to reload existing datasets from scratch. This flexibility enables dynamic analysis and reporting on evolving data streams.

The repository at https://github.com/apache/hive contains the source code, documentation, and build scripts for developing and maintaining the software. It provides a comprehensive set of tools and libraries that facilitate its integration with other components in the Hadoop ecosystem, such as YARN for resource management and HBase for real-time querying.

Hive's architecture is modular, making it extensible and customizable for various use cases. The project includes several sub-modules like Hive WebHCat (a REST-based web service API), Beeline (a command line shell tool), and the metastore component that manages metadata about tables and partitions.

The community around Apache Hive is active, with frequent contributions from individuals and organizations that rely on it for data processing at scale. The repository reflects this activity through its robust issue tracking system, pull requests, and continuous integration workflows that ensure stability and performance improvements.

Overall, the Apache Hive project is a critical tool in the big data ecosystem, providing an accessible way to manage and analyze large volumes of data efficiently while being flexible enough to adapt to various processing requirements. Its open-source nature ensures ongoing innovation and support from a diverse community.

hive
by
apacheapache/hive

Repository Details

Fetching additional details & charts...