crawl4ai
by
unclecode

Description: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

View on GitHub ↗

Summary Information

Updated 47 minutes ago

Added to GitGenius on July 23rd, 2025

Created on May 9th, 2024

Open Issues & Pull Requests: 102 (+0)

Number of forks: 7,386

Total Stargazers: 71,989 (+0)

Total Subscribers: 386 (+0)

Issue Activity (beta)

Open issues: 28

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 63 days

Stale 30+ days: 17

Stale 90+ days: 7

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 4

Events in 7 days: 4

Top labels

🐞 Bug (564)
❓ Question (219)
🩺 Needs Triage (215)
📌 Root caused (156)
⚙ Done (141)
✨ Enhancement (65)
⚙️ In-progress (54)
📖 Documentation (31)

Most active issues this week

#2020 [Bug]: Package 'nvidia-cuda-toolkit' has no installation candidate - 2 events / 2 comments
#2040 [Bug]: Redis AuthenticationError when calling /crawl endpoint - 2 events / 2 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 19.2 hours

90th percentile: 2.0 days

Tracked items: 992

Most active contributors

unclecode - 2,059 events, 585 issues
ntohidi - 1,266 events, 251 issues
aravindkarnam - 1,048 events, 321 issues
Ahmed-Tawfik94 - 335 events, 133 issues
SohamKukreti - 254 events, 75 issues

Related by overlapping contributors

Detailed Description

Crawl4AI is a powerful and flexible open-source web crawling and data extraction framework specifically designed for building large-scale datasets for Artificial Intelligence and Machine Learning applications. Developed by UncleCode, it distinguishes itself from general-purpose crawlers by prioritizing data quality, structured output, and ease of integration with AI/ML pipelines. Instead of simply fetching HTML, Crawl4AI focuses on extracting specific data points based on user-defined schemas, making it ideal for tasks like training language models, building knowledge graphs, or creating datasets for computer vision.

The core of Crawl4AI revolves around a declarative configuration system. Users define *crawlers* using YAML files, specifying the starting URLs, the rules for navigating the website (following links, handling pagination), and most importantly, the *extractors*. Extractors are defined using CSS selectors or XPath expressions to pinpoint the desired data elements within the HTML structure. This declarative approach significantly simplifies the crawling process, allowing users to focus on defining *what* data they need rather than *how* to retrieve it. The framework handles the complexities of request scheduling, rate limiting, and error handling.

A key feature is the support for multiple output formats. Crawl4AI can output extracted data in JSON, JSON Lines (JSONL), CSV, and Parquet formats, catering to various downstream processing requirements. The JSONL format is particularly well-suited for large datasets as it allows for streaming data without needing to load the entire dataset into memory. Furthermore, it integrates seamlessly with popular data storage solutions like Amazon S3, Google Cloud Storage, and Azure Blob Storage, enabling scalable data warehousing. The framework also supports data filtering and transformation during the extraction process, allowing for basic data cleaning and preprocessing.

Crawl4AI is built with scalability in mind. It leverages asynchronous programming (using asyncio in Python) to handle a large number of concurrent requests efficiently. It also supports distributed crawling, allowing users to run multiple crawler instances across different machines to accelerate data collection. The framework provides built-in mechanisms for managing proxies, handling cookies, and respecting robots.txt, ensuring responsible and ethical web crawling. It also includes features for detecting and handling dynamic content loaded via JavaScript, though this often requires integration with a headless browser like Puppeteer or Playwright.

Beyond the core crawling functionality, Crawl4AI offers several useful extensions and tools. These include a command-line interface (CLI) for managing crawlers, a web UI for monitoring crawling progress, and a set of pre-built crawlers for popular websites. The project is actively maintained and has a growing community, providing support and contributing new features. In essence, Crawl4AI provides a robust and efficient solution for building high-quality, structured datasets from the web, specifically tailored for the demands of modern AI and Machine Learning projects.

crawl4ai
by
unclecode

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

crawl4ai
by
unclecodeunclecode/crawl4ai

Repository Details

crawl4ai by unclecode

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

crawl4ai by unclecodeunclecode/crawl4ai

Repository Details

crawl4ai
by
unclecode

crawl4ai
by
unclecodeunclecode/crawl4ai