LiteParse is an open-source document parsing tool designed for fast, lightweight, and local processing of documents, with a primary focus on PDF files. Its core purpose is to provide high-quality spatial text extraction, including bounding box information, without relying on proprietary large language model (LLM) features or cloud services. Everything runs locally, making it suitable for privacy-sensitive and air-gapped environments.
The tool supports a wide range of input formats, including PDFs, Office documents (Word, PowerPoint, Excel), and images. Office documents are automatically converted to PDF using LibreOffice, while images are converted via ImageMagick, enabling seamless parsing regardless of the original file type. LiteParse’s parsing engine is built on PDFium for fast and accurate text extraction, and it integrates a flexible OCR system for handling scanned or image-based documents. The default OCR engine is Tesseract, which is bundled for zero-setup use, but users can also plug in external OCR servers such as EasyOCR or PaddleOCR via a simple HTTP API specification. This flexibility allows for improved accuracy or performance depending on user needs.
LiteParse offers multiple output formats, including structured JSON (with text and bounding boxes) and plain text that preserves layout. Additionally, it can generate high-quality screenshots of document pages, which are particularly useful for LLM agents that require visual context. The tool is multi-platform, supporting Linux, macOS (Intel and ARM), and Windows, and is accessible from Rust, Node.js/TypeScript, Python, and browser environments via WebAssembly (WASM). Language bindings are provided through napi-rs for Node.js, PyO3 for Python, and wasm-bindgen for WASM, ensuring consistent functionality across ecosystems.
The CLI interface, available through npm, pip, or cargo, provides a unified experience for parsing files, batch processing directories, and generating screenshots. Users can specify output formats, target pages, OCR settings (including language and server URL), rendering DPI, and other options. Batch parsing is supported for large-scale document processing, and the CLI can handle remote files via standard input. Environment variables such as TESSDATA_PREFIX allow for offline OCR operation by specifying the location of Tesseract language data files.
For development, LiteParse is structured as a Rust workspace with the core library and separate crates for language-specific bindings. The project leverages several open-source technologies, including PDFium for PDF handling, Tesseract for OCR, EasyOCR and PaddleOCR for optional HTTP OCR, napi-rs and PyO3 for bindings, and wasm-bindgen for WASM support. The Apache 2.0 license ensures broad usability and contribution.
LiteParse is ideal for users who need fast, reliable, and local document parsing without cloud dependencies. It is particularly well-suited for extracting structured text and layout information from PDFs and other documents, supporting downstream applications such as LLM agents, data pipelines, and document analysis tools. For more complex parsing tasks, such as dense tables or handwritten text, the repository recommends LlamaParse, a cloud-based parser, but LiteParse remains a robust solution for most local parsing needs.