langextract
by
google

Description: A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.

View on GitHub ↗

Summary Information

Updated 56 minutes ago

Added to GitGenius on December 22nd, 2025

Created on July 8th, 2025

Open Issues & Pull Requests: 107 (+0)

Number of forks: 2,563

Total Stargazers: 37,123 (+0)

Total Subscribers: 163 (+0)

Issue Activity (beta)

Open issues: 71

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 173 days

Stale 30+ days: 68

Stale 90+ days: 66

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

discussion (13)
alternative-llm (11)
enhancement (11)
bug (7)
plugin (5)
documentation (2)
more details required (2)
question (2)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 11.2 hours

Mean response time: 6.9 days

90th percentile: 21.3 days

Tracked items: 174

Most active contributors

aksg87 - 300 events, 126 issues
the-vampiire - 11 events, 5 issues
JustStas - 10 events, 2 issues
python26 - 7 events, 4 issues
RobertGWolf - 6 events, 2 issues

Related by overlapping contributors

Detailed Description

LangExtract is a Python library developed by Google that enables extraction of structured information from unstructured text documents using large language models. The library is specifically designed to address the challenge of converting raw text, such as clinical notes or reports, into organized, schema-compliant data while maintaining precise traceability back to the source material.

The core functionality centers on source grounding, which maps every extracted piece of information to its exact location in the original text. This capability enables visual highlighting and verification, allowing users to trace where each extracted data point originated. The library enforces consistent output schemas based on user-provided examples and leverages controlled generation in supported models like Gemini to guarantee structured results that conform to specified formats.

LangExtract is optimized for processing long documents through an intelligent strategy combining text chunking, parallel processing, and multiple extraction passes to improve recall on large materials. The library generates interactive, self-contained HTML visualizations that allow users to review thousands of extracted entities in their original context, making it easy to explore and validate results. The visualization system handles large result sets seamlessly, as demonstrated in examples extracting hundreds of entities from full novels.

The library supports flexible LLM integration across multiple providers. It works with cloud-based models including the Google Gemini family, OpenAI models, and local open-source models through a built-in Ollama interface. The provider system uses a lightweight plugin architecture allowing custom LLM providers to be added independently without modifying core code. Users can register new providers via decorators and distribute them as separate Python packages while keeping custom dependencies isolated.

LangExtract requires minimal setup for extraction tasks. Users define extraction tasks by creating prompts that describe what information to extract and providing high-quality examples to guide model behavior. The library automatically detects when LLMs extract content from few-shot examples rather than input text, flagging these ungrounded extractions so users can filter them out. The library raises prompt alignment warnings by default if examples don't follow best practices, helping users resolve issues for optimal results.

The repository has grown to 37,000 stargazers as of the most recent tracking period. According to activity metrics, the most active contributor is aksg87 with 300 tracked events, followed by the-vampiire with 11 events and JustStas with 10 events. The median issue and pull request response latency across 174 tracked items is 11.2 hours, with a mean of 166.6 hours. The most active issue labels are discussion with 13 items, alternative-llm with 11 items, and enhancement with 11 items, indicating ongoing community engagement around extending functionality and exploring alternative language model integrations.

The library supports enterprise-scale operations through Vertex AI Batch API integration, allowing users to reduce costs on large-scale extraction tasks. API key setup is straightforward, supporting environment variables, .env files, and Vertex AI service accounts for authentication with cloud models. The library is available on PyPI for standard installation and can also be installed from source or via Docker, making it accessible across different deployment scenarios.

langextract
by
google

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

langextract
by
googlegoogle/langextract

Repository Details

langextract by google

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

langextract by googlegoogle/langextract

Repository Details

langextract
by
google

langextract
by
googlegoogle/langextract