docling
RepoDocling is an open-source, C-based PDF extractor with Python bindings that converts documents into structured JSON for high-throughput retrieval-augmented generation pipelines. It is aimed at developers who need very fast, layout-aware text and metadata extraction from large volumes of PDFs.