PageIndex
by
VectifyAI

Description: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

View on GitHub ↗

Summary Information

Updated 19 minutes ago

Added to GitGenius on November 11th, 2025

Created on April 1st, 2025

Open Issues & Pull Requests: 134 (+0)

Number of forks: 2,953

Total Stargazers: 33,909 (+1)

Total Subscribers: 139 (+0)

Issue Activity (beta)

Open issues: 67

New in 7 days: 1

Closed in 7 days: 30

Avg open age: 69 days

Stale 30+ days: 55

Stale 90+ days: 42

Recent activity

Opened in 7 days: 1

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

duplicate (1)

Most active issues this week

#34 Can it be used for non structured docs? - 8 events / 1 comments
#13 How to integrate with semantic vector-based search? - 6 events / 1 comments
#237 How to add Self hosted LLMs Like Gemma 4 and Qwen ? - 6 events / 1 comments
#201 这个支持txt文档以及word文档吗？ - 5 events / 1 comments
#16 How to get the contents of the nodes - 4 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 2.8 days

Mean response time: 19.1 days

90th percentile: 47.5 days

Tracked items: 94

Most active contributors

KylinMountain - 75 events, 39 issues
zmtomorrow - 20 events, 12 issues
rejojer - 15 events, 12 issues
BukeLy - 14 events, 7 issues
15846363412 - 4 events, 4 issues

Related by overlapping contributors

Detailed Description

PageIndex is a vectorless, reasoning-based retrieval-augmented generation system designed to overcome the limitations of traditional vector database approaches for professional document analysis. Rather than relying on semantic similarity search through vector embeddings, PageIndex uses large language models to reason over hierarchical tree-structured indexes of documents, enabling context-aware and explainable retrieval that mirrors how human experts navigate complex materials.

The core innovation of PageIndex is its two-step retrieval process. First, it generates a table-of-contents-style tree structure index from long documents, organizing content into natural sections rather than artificial chunks. Second, it performs agentic reasoning-based retrieval through tree search, where LLMs traverse and reason about the document structure to locate relevant information. This approach eliminates the need for vector databases and chunking while providing full traceability and explainability, as every retrieval result is grounded in explicit page and section references.

PageIndex addresses a fundamental problem in RAG systems: the distinction between similarity and relevance. Traditional vector-based RAG prioritizes semantic similarity, which often fails to capture true relevance in professional documents requiring contextual understanding and multi-step reasoning. The system is particularly suited for financial reports, legal documents, regulatory filings, technical manuals, medical literature, academic textbooks, and other long, complex professional documents where domain expertise and contextual reasoning are critical.

The repository provides multiple deployment options. Users can self-host PageIndex locally using the open-source code with standard PDF parsing, integrate it via MCP or API with the cloud service for enhanced OCR and tree building, or deploy enterprise solutions with dedicated or private infrastructure. The project includes practical examples such as an agentic vectorless RAG demonstration using OpenAI Agents SDK, along with Jupyter notebooks for vectorless RAG and vision-based RAG workflows that work directly over PDF page images without OCR.

According to GitGenius tracking data, the repository has demonstrated steady growth with stargazers increasing from 33,725 to 33,727 since early July 2026. The project maintains active community engagement with a median issue and pull request response latency of 67.7 hours across 94 tracked items, though mean latency extends to 458.4 hours. The most active contributors tracked by GitGenius are KylinMountain with 75 events, zmtomorrow with 20 events, and rejojer with 15 events. The repository shares overlapping contributors with related projects including langgenius/dify, tracel-ai/burn, and nousresearch/hermes-agent, indicating integration within a broader ecosystem of AI and agentic systems.

PageIndex has achieved notable performance benchmarks, reaching 98.7 percent accuracy on FinanceBench, a financial document question-answering benchmark, substantially outperforming vector RAG solutions on professional document analysis tasks. The system is available as a ChatGPT-style chat platform at chat.pageindex.ai, with comprehensive documentation, tutorials, and blog resources available at pageindex.ai and docs.pageindex.ai. The project is classified across multiple domains including vector indexing, unstructured data processing, information retrieval, embeddings, vector search, semantic search, document processing, AI frameworks, data management, and knowledge bases.

PageIndex
by
VectifyAI

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

PageIndex
by
VectifyAIVectifyAI/PageIndex

Repository Details

PageIndex by VectifyAI

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

PageIndex by VectifyAIVectifyAI/PageIndex

Repository Details

PageIndex
by
VectifyAI

PageIndex
by
VectifyAIVectifyAI/PageIndex