TOPIC_NODE
DIGEST_COUNT: 1
STRUCTURED PDF EXTRACTOR FOR RAG CLAIMS ~300 PAGES/S ON CPU
calendar_today
FIRST_SEEN 2026-01-06
update
LAST_SYNC 2026-01-06
[ OVERVIEW ]
A new C-based PDF extractor with Python bindings outputs structured JSON (geometry, typography, headings) and claims ~300 pages/second on CPU—about 30x faster than pymupdf4llm. It targets high-volume RAG pipelines with layout-aware chunking; no OCR or image extraction yet, and external benchmarks are not provided.
[ ALL_SOURCES ]
[ STORY_TIMELINE ]
Structured PDF extractor for RAG claims ~300 pages/s on CPU
A new C-based PDF extractor with Python bindings outputs structured JSON (geometry, typography, headings) and claims ~300 pages/second on CPU—about 30x faster than pymupdf4llm. It targets high-volume RAG pipelines with layout-aware chunking; no OCR or image extraction yet, and external benchmarks are not provided.