run-llama/liteparse

Description: A fast, helpful, and open-source document parser

View on GitHub ↗Jump to charts ↓

Summary Information

Updated 3 hours ago

Added to GitGenius on May 31st, 2026

Created on February 9th, 2026

Open Issues & Pull Requests: 25 (+1)

Number of forks: 788

Total Stargazers: 11,658 (+3)

Total Subscribers: 37 (+0)

Issue Activity (beta)

Open issues: 16

New in 7 days: 6

Closed in 7 days: 3

Avg open age: 14 days

Stale 30+ days: 6

Stale 90+ days: 1

Recent activity

Opened in 7 days: 6

Closed in 7 days: 3

Comments in 7 days: 6

Events in 7 days: 13

Top labels

bug (49)
enhancement (26)

Most active issues this week

#351 [Bug] parse() / isComplex() panic with RuntimeError: unreachable on valid PDFs using CIDFontType2 fonts with a non-Identity CIDToGIDMap stream - 15 events / 7 comments
#346 [Bug] linux-x64-gnu native package requires GLIBC_2.35, breaking AWS Lambda AL2023 - 8 events / 2 comments
#153 [Feature] Rotated page detection - 3 events / 1 comments
#357 [Bug] "unsupported file format" Error when passing file via stdin - 3 events / 2 comments
#345 [Feature] Include complexity signals in parse output - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 42.9 hours

90th percentile: 10.9 hours

Tracked items: 104

Most active contributors

logan-markewich - 226 events, 100 issues
AdemBoukhris457 - 24 events, 24 issues
marcosmarf27 - 8 events, 3 issues
Netlance-lux - 7 events, 1 issues
adarshmadrecha - 7 events, 5 issues

Related by overlapping contributors

Detailed Description

LiteParse is an open-source document parser written in Rust that specializes in fast, lightweight PDF parsing with spatial text extraction. The project is maintained by the LlamaIndex team and provides high-quality text parsing with bounding box information without relying on proprietary language models or cloud dependencies. All processing runs locally on the user's machine, making it suitable for offline and air-gapped environments.

The core functionality centers on spatial text parsing using PDFium for PDF rendering and text extraction. LiteParse includes a flexible OCR system with Tesseract bundled by default for zero-setup operation, while also supporting integration with HTTP-based OCR servers like EasyOCR and PaddleOCR for users who need higher accuracy. The library provides a standardized OCR API specification that allows integration of custom OCR services.

A distinctive feature is the complexity detection capability, which performs a cheap text-layer-only analysis to determine whether a document needs OCR or heavier processing before committing to a full parse. This allows users to route documents to different pipelines, reject unsuitable documents, or estimate processing costs. The detection identifies specific reasons a page might need OCR, including scanned content, missing text, sparse text, embedded images, garbled text, and vector text issues.

LiteParse supports multiple output formats including Markdown with reconstructed headings, tables, lists, images, and links, as well as JSON and plain text. The Markdown output is designed specifically for feeding documents into language models and RAG pipelines. The library can also generate high-quality page screenshots for LLM agents to extract visual information that text alone cannot capture.

The project is available across multiple programming languages and platforms. Users can install LiteParse via npm for Node.js and TypeScript, pip for Python, cargo for Rust, or as a WebAssembly package for browser use. The same command-line interface is available across all installations, supporting batch parsing of entire directories, individual file parsing, screenshot generation, and complexity checking.

LiteParse also handles automatic conversion of various document formats to PDF before parsing, including Office documents like Word, PowerPoint, and spreadsheets through LibreOffice integration, and image formats through ImageMagick. This multi-format input support makes it versatile for different document sources.

According to GitGenius activity tracking, the repository shows strong maintenance with a median issue and pull request response latency of zero hours and a mean latency of 46.3 hours across 96 tracked items. The most active contributor is logan-markewich with 201 tracked events, followed by AdemBoukhris457 with 24 events. Bug reports and enhancement requests are the most common issue types, with 44 and 24 respectively. The repository shares overlapping contributors with related projects including datasette, claude-code, and deepagents, indicating integration within a broader ecosystem of document processing and AI tools.

run-llama/liteparse

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

liteparse
by
run-llamarun-llama/liteparse

Repository Details

run-llama/liteparse

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

liteparse by run-llamarun-llama/liteparse

Repository Details

liteparse
by
run-llamarun-llama/liteparse