Skip to main content
Safe, Open, High-Performance — PDF for AI OpenDataLoader PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG). It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.

Overview

Integration details

ClassPackageLocalSerializableJS support
OpenDataLoader PDFlangchain-opendataloader-pdf

Loader features

SourceDocument Lazy LoadingNative Async Support
OpenDataLoaderPDFLoader
The OpenDataLoaderPDFLoader component enables you to parse PDFs into structured Document objects.

Requirements

  • Python >= 3.9
  • Java 11 or newer available on the system PATH
  • opendataloader-pdf >= 1.1.1

Installation

pip install -U langchain-opendataloader-pdf

Quick start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"], 
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])

Parameters

ParameterTypeRequiredDefaultDescription
file_pathList[str]✅ YesOne or more PDF file paths or directories to process.
formatstrNoNoneOutput formats (e.g. "json", "html", "markdown", "text").
quietboolNoFalseSuppresses CLI logging output when True.
content_safety_offOptional[List[str]]NoNoneList of content safety filters to disable (e.g. "all", "hidden-text", "off-page", "tiny", "hidden-ocg").

Additional Resources

I