olmocr
by
allenai

Description: Toolkit for linearizing PDFs for LLM datasets/training

View on GitHub ↗

Summary Information

Updated 15 minutes ago

Added to GitGenius on October 30th, 2025

Created on September 17th, 2024

Open Issues & Pull Requests: 83 (+0)

Number of forks: 1,564

Total Stargazers: 19,045 (+0)

Total Subscribers: 100 (+0)

Issue Activity (beta)

Open issues: 56

New in 7 days: 1

Closed in 7 days: 0

Avg open age: 148 days

Stale 30+ days: 55

Stale 90+ days: 50

Recent activity

Opened in 7 days: 1

Closed in 7 days: 0

Comments in 7 days: 1

Events in 7 days: 2

Top labels

bug (111)
documentation (17)

Most active issues this week

#470 Document if multiple instances can use the same workspace - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.3 hours

Mean response time: 5.2 days

90th percentile: 6.5 days

Tracked items: 281

Most active contributors

jakep-allenai - 404 events, 179 issues
aman-17 - 165 events, 92 issues
xcvil - 21 events, 9 issues
cqray1990 - 12 events, 9 issues
likith1908 - 12 events, 8 issues

Related by overlapping contributors

Detailed Description

The `allenai/olmocr` repository introduces OLMOCR, an innovative Optical Layout Model for OCR that significantly advances document understanding by integrating layout awareness into its core architecture. Developed by Allen AI, this project addresses the long-standing challenges traditional OCR systems face when processing complex, visually rich documents, such as academic papers, financial reports, or historical archives, which often feature multi-column layouts, embedded tables, figures, and diverse font styles.

At its heart, OLMOCR is a transformer-based model designed to jointly learn and reason about both the visual and textual features present in document images. Unlike conventional OCR, which primarily focuses on character recognition and then attempts to infer structure, OLMOCR inherently understands the spatial relationships between text segments and other visual elements. This "layout-aware" approach allows the model to build a more coherent and accurate representation of the document's content and structure, leading to superior performance in tasks like information extraction and semantic parsing.

The model's robust capabilities stem from its sophisticated architecture and extensive pre-training. OLMOCR leverages transformer networks, similar to those popularized in natural language processing, but adapted to handle multimodal input (image pixels and detected text). It undergoes large-scale pre-training on vast datasets like DocLayNet, which provides detailed layout annotations, and IIT-CDIP, a massive collection of scanned documents. This pre-training phase enables the model to learn a rich, generalized understanding of various document layouts and content types, making it highly adaptable to new, unseen documents.

Key features of OLMOCR include its ability to process documents with diverse and challenging layouts, its capacity for multimodal feature fusion, and the availability of pre-trained models that can be fine-tuned for specific downstream tasks. The repository provides the necessary code for training, inference, and evaluation, making it accessible for researchers and developers. It includes utilities for preparing datasets, defining model architectures, and running experiments, fostering further research and application development in the field of document AI.

The practical implications of OLMOCR are substantial. By accurately extracting information from complex documents, it can revolutionize various industries. In digital libraries and archives, it enables more precise indexing and search capabilities. For business process automation, it can streamline data entry and information extraction from invoices, contracts, and forms. In research, it facilitates the automated analysis of large corpora of scientific literature. OLMOCR represents a significant step towards creating more intelligent and autonomous systems for document understanding, pushing the boundaries of what's possible with automated document processing.

olmocr
by
allenai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

olmocr
by
allenaiallenai/olmocr

Repository Details

olmocr by allenai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

olmocr by allenaiallenai/olmocr

Repository Details

olmocr
by
allenai

olmocr
by
allenaiallenai/olmocr