elasticsearch-hadoop
by
elastic

Description: :elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

View on GitHub ↗

Summary Information

Updated 1 hour ago

Added to GitGenius on April 7th, 2021

Created on March 11th, 2013

Open Issues & Pull Requests: 139 (+0)

Number of forks: 998

Total Stargazers: 1,974 (+0)

Total Subscribers: 468 (+0)

Issue Activity (beta)

Open issues: 30

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 719 days

Stale 30+ days: 29

Stale 90+ days: 25

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

:Serialization (3)
>docs (3)
bug (3)
doc (3)
:Core (2)
:Project-Meta (1)
:Spark (1)
enhancement (1)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 2.3 days

Mean response time: 51.5 days

90th percentile: 168.0 days

Tracked items: 25

Most active contributors

jbaiera - 30 events, 7 issues
masseyke - 30 events, 16 issues
KOTungseth - 3 events, 3 issues
bohdanpiddubny - 3 events, 1 issues
landlord-matt - 3 events, 1 issues

Related by overlapping contributors

Detailed Description

Elasticsearch Hadoop is a Java-based integration library that enables real-time search and analytics capabilities from Elasticsearch to work natively within Hadoop ecosystems. The project provides native integration across multiple big data processing frameworks including Apache Hadoop MapReduce, Apache Hive, and Apache Spark, allowing users to read from and write to Elasticsearch clusters directly from their data processing jobs.

The library is designed with a focus on minimal dependencies and ease of deployment. It ships as a small, self-contained jar file of approximately 300 kilobytes with no external dependencies, requiring only network access to an Elasticsearch cluster via REST API. Users can add the jar to their job classpath through various methods including bundling, DistributedCache, or manual cluster provisioning. The project supports Hadoop 2.x and 3.x on YARN, Spark versions 3.0 through 3.4 with Scala 2.12 and 2.13 support depending on the Spark version, and maintains backward compatibility with older Elasticsearch versions, though matching version numbers between ES-Hadoop and Elasticsearch is recommended.

For MapReduce environments, the library provides dedicated InputFormat and OutputFormat classes named EsInputFormat and EsOutputFormat that handle reading and writing operations at the low level. For Apache Hive users, ES-Hadoop offers a storage handler that allows definition of external tables backed by Elasticsearch indices, with field mapping to JSON for communication with Elasticsearch. Hive users can read from and write to Elasticsearch using standard SQL-like syntax with configuration through TBLPROPERTIES.

Apache Spark integration is particularly comprehensive, providing native Java and Scala support through dedicated RDD classes for reading and methods for writing on any RDD. The library also supports Spark SQL, allowing users to work with Elasticsearch data through SQL queries. Configuration across all frameworks uses properties prefixed with "es", with a reserved "es.internal" namespace for library use.

The project maintains active development with response latencies showing a median of 55.9 hours across tracked issues and pull requests. Primary contributors jbaiera and masseyke have each logged 30 events in the tracked activity, indicating sustained engagement. Documentation labels, serialization issues, and bug reports represent the most active issue categories. The codebase is built using Gradle and requires JVM 8 or higher for compilation. The project is released under Apache License 2.0 and maintains connections with other Elastic projects including the main Elasticsearch repository.

elasticsearch-hadoop
by
elastic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

elasticsearch-hadoop
by
elasticelastic/elasticsearch-hadoop

Repository Details

elasticsearch-hadoop by elastic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

elasticsearch-hadoop by elasticelastic/elasticsearch-hadoop

Repository Details

elasticsearch-hadoop
by
elastic

elasticsearch-hadoop
by
elasticelastic/elasticsearch-hadoop