parse PDF

Parsing PDFs for text and structure has traditionally been challenging, but recent advances in open-source tools and AI models are making it easier to extract all information – including structured elements like tables and charts – from digitally-generated PDFs. > [!NOTE] > Parsing of PDFs as preparation for an [[natural language processing|NLP]] workflow that involves an [[large language model|LLM]] may not actually be necessary, depending on your use case. It may be better to convert the PDF to image to allow a VLM to reason over the image one page at a time. For simple text extraction tasks, use one of the free [[#basic Python libraries]] `pdfminer.six`, `PyMuPDF`, and `pypdf` or try one of the [[#other services]], many of which have a free web-based converter to test on your document. Also check out [Apache Tika](https://tika.apache.org/) if you're looking for conversion of a wide variety of formats. It's no longer the best but it is versatile. The best paid service for human-readable output (in my opinion) is [Mistral AI's OCR](https://mistral.ai/news/mistral-ocr) at ~$1 per 1000 pages. For layout- and table-aware parsing (important for LLM pipelines), use [Unstructured](https://unstructured.io/) or MinerU. For structured outputs, consider [NuExtract](https://nuextract.ai/). For full service extraction on a large number of PDFs, build a pipeline yourself with a combination of text extraction, [[#layout recognition]], and OCR technologies or use an [[#enterprise cloud services|enterprise cloud service]]. When evaluating options, consider these factors: - Output: do you want a human-readable output like plain text or Markdown? - Structure: do you need to extract structured elements like tables and charts? - Integrations: do you need to integrate with a Python workflow? Is this part of an LLM workflow (e.g., RAG pipeline)? - Cost: what are you willing to pay? - Speed: how fast do documents need to be processed? - Privacy: do you need to run on-prem or with assurances your data won't be stored? ## basic Python libraries The basic [[Python]] libraries for extracting text from PDF are [pdfminer.six](https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html), `PyMuPDF` (also called `fitz`), and `pypdf`. They have limited awareness of layout or structure – they may return text as scattered lines or paragraphs without preserving tables or reading order. They’re fast and useful as a first step, but by themselves they often miss complex structure (e.g. they won’t reconstruct a table or identify a chart). [Camelot](https://camelot-py.readthedocs.io/en/master/) and [Tabula](https://github.com/tabulapdf/tabula) are popular choices for extracting tables into structured formats. [PDFplumber](https://github.com/jsvine/pdfplumber) can extract both text and tables; it has a helpful visual debugger to adjust the extraction of tables but requires table-by-table manual adjustment. [PDFtables](https://pdftables.com/) is an online service that provides an API for converting PDF tables to different formats. ## other services [Docling](https://docling-project.github.io/docling/) returns a unified document representation format called `DoclingDocument` regardless of the original document format. This can be great when your corpus has multiple document types and you don't want a separate pipeline for each. [Unstructured](https://unstructured.io/) is focused on turning documents (PDF, Word, images, etc.) into structured, “AI-ready” data. With one function call, it partitions a PDF into a list of elements, each tagged with its type (paragraph, heading, list, table, etc.) and metadata. They have an [open source Python library](https://docs.unstructured.io/open-source/introduction/quick-start) with limited capability, useful to testing out the capabilities. Advanced capabilities are available through their [API](https://unstructured.io/api-key) (which starts at $500/mo). Unstructured can preserve table layouts by using a high-resolution strategy: setting `strategy="hi_res"` will apply computer vision + OCR to detect tables and return each table both as extracted text and as an HTML snippet preserving structure. [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) (released 2024) is a powerful Python toolkit that combines state-of-the-art models for different tasks. For example, it uses deep learning for layout detection (they fine-tuned models like LayoutLMv3 and even custom YOLO-based detectors for page elements), integrates **OCR** (PaddleOCR) for text, a **Table recognition** module that can convert detected table images into structured HTML/Markdown (using models like TableMaster or their new StructEqTable), and even specialized models for **math formulas** (detecting and converting equations). The sister project [MinerU](https://github.com/opendatalab/MinerU) converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. Test it on the free online demo [here](https://huggingface.co/spaces/opendatalab/MinerU). If you prefer a GUI approach, use the [desktop client](https://mineru.net/). [LLMWhisperer](https://unstract.com/llmwhisperer/) (by [Unstract](https://unstract.com/)) extracts text in a layout-preserving way (including tables drawn with spaces) such that if you feed it to a large language model, the LLM can interpret the table structure from the text directly. With 100 pages per day in the free tier, it is an affordable option for small projects. Try it out [here](https://pg.llmwhisperer.unstract.com/). You'll see the output is not as well formatted as MinerU, but will be sufficient for the purposes of most RAG systems. [NuExtract](https://nuextract.ai/) (by NuMind) requires a template (JSON document with desired fields and data types) but performs very well compared to frontier models to return exactly the information required, responding "I don't know" when a field is not found. Use the API via the [Python SDK](https://github.com/numindai/nuextract-platform-sdk). Pricing is token-based and fairly low (a few cents per 100 pages). ## enterprise cloud services Enterprise cloud services offer ready-to-use APIs that identify structure automatically and return rich data – often the fastest path to getting every bit of information from a PDF without building your own ML pipeline. These services are generally pay-as-you-go, so you can process documents on demand and scale up as needed, paying per page or per character (analogous to per token) rather than any upfront license. - [Adobe PDF Extract API](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/): Use the [Python SDK](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/gettingstarted/#python) and start with a [Free Tier]((https://acrobatservices.adobe.com/dc-integration-creation-app-cdn/main.html?api=pdf-extract-api) which includes 500 free Document Transactions per month. - [AWS Textract](https://aws.amazon.com/textract/): As part of the [AWS Free Tier](https://aws.amazon.com/free/), you can get started with Amazon Textract for free and analyze up to 100 pages per month. - [Google Cloud Document AI](https://cloud.google.com/document-ai). New customers get $300 in free credit to try Document AI and other Google Cloud products - [Azure Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence): see this [tutorial](https://www.elastic.co/search-labs/blog/azure-ai%E2%80%93document-intelligence-parse-pdf-text-tables) from Elasticsearch. - [Mistral OCR](https://mistral.ai/news/mistral-ocr) (API by Mistral AI): Mistral OCR is an ideal model to use in combination with a RAG system taking multimodal documents (such as slides or complex PDFs) as input. The cost is $1/1000 pages. ## layout recognition For more complex workflows, a first step is to label the elements in the layout of the PDF, for example each image, table and text area. These labels can be used to train or fine-tune models for your corpus. Most of the enterprise cloud services do this as part of their processing pipeline. If you're creating your own pipeline, try these libraries. The **LayoutLM** series (v1, v2, v3) by Microsoft is a prime example – these models are pre-trained on annotated PDFs to understand spatial layout along with text, useful for tasks like form understanding or table structure recognition. Open-source implementations of LayoutLM and related models are available via Hugging Face and can be fine-tuned for custom document types. Another example is **Donut (Document Understanding Transformer)** by NAVER, an OCR-free model that can directly generate structured output from document images. These models require more ML expertise to use effectively (often you fine-tune them on labeled data, e.g. bounding boxes or structured labels), but they represent the cutting edge in extracting structured data from visually rich documents. ## emerging trends One emerging approach is using large vision-language models (VLMs) to analyze documents: for instance, NVIDIA research in 2025 compared a pipeline of specialized models vs. a single multimodal GPT-style model for PDF extraction. The finding was that a dedicated multi-model pipeline (detecting tables, charts, etc. with specialized OCR modules) still gave better accuracy and efficiency than a generalist VLM describing the content. However, the gap may be closing--OpenAI’s GPT-4 with vision, for example, can interpret images of a document quite well. ## considerations for use with LLMs If parsing the PDFs is part of a workflow leveraging LLMs for RAG or QA, your PDF parser should preserve semantic structure and layout to support both: - **retrieval-based chunking** (for context windows) - **reasoning across document structure** (e.g., figures, tables, sections) Consider how each feature will support the LLM. | Parser Feature | Why It Matters for LLM | | -------------------------------------------------------------- | ------------------------------------------- | | **Logical segmentation** (headings, paragraphs, tables) | Allows for semantically coherent chunks | | **Reading order preservation** | Prevents disjointed or scrambled input | | **Table handling (HTML or Markdown)** | Lets the LLM “see” structured data properly | | **Chunk-friendly output** (e.g., list of sections or elements) | Makes embedding + retrieval modular | | | | Use logical block-based chunking (e.g., with `unstructured` or another layout-aware and table-aware-parser). New tools, like [LangExtract](https://github.com/google/langextract/) are also simplifying the pipeline for tasks like [[named entity recognition]] over PDFs without converting to text first. ## implementations ### pyMuPDF pyMuPDF is fast but does not attempt to parse tables which could impact performance. ### Unstructured [Unstructured integration documentation](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/). Add the `unstructured` library to your `uv` project. ```bash uv add unstructured[pdf] ``` Install system dependencies [poppler-utils](https://poppler.freedesktop.org/) and [tesseract-ocr](https://github.com/tesseract-ocr/tesseract) (for images and PDFs). ```bash sudo apt update sudo apt install poppler-utils sudo apt install tesseract-ocr ``` Why Use? Unstructured retrieves a pretty good structured representation but isn't best in class. The structured output would allow separate strategies for text and tables, but the open source version doesn't consistently identify tables (e.g., sees left half of acronym list as table but not right half). It has an integration with LangChain that makes it an easy choice for a streamlined workflow. On the downside, advanced functionality with their API costs $500/mo. ### Mistral Get an API key by signing up for the pay-as-you-go service. Add Mistral and `python-dotenv` to your [[uv]] project. ```bash uv add mistralai python-dotenv ``` Images can also be saved out to an output folder and rendered with transclusion or embedded directly as base64. Why Use? Mistral provides the highest fidelity Markdown representation. Like MinerU, images can be saved and reasoned over by a VLM. ### MinerU Add MinerU to your uv project. ```bash uv add "minerU[core]" ``` Run from the command line. ```bash uv run mineru -p <path/to/file.pdf> -o <output_dir> ``` For a Python implementation, see the [demo](https://github.com/opendatalab/MinerU/blob/master/demo/demo.py) or for advanced functionality see this [walk through](https://stable-learn.com/en/mineru-tutorial/). Why use? MinerU extracts not just text but layout and screen captures of each image (text box, table or chart). We can summarize each image to get an embedding and retrieve it on inference. This allows a VLM (GPT-4) to reason over the image, rather than a faulty Markdown representation. The downside is high upfront cost for processing and storage. ### pdfPlumber Great for semi-manual extraction of data from tables with the visual debugging tool. Not generalizable enough for an unsupervised pipeline. ## docling Add docling to the uv project. ```bash uv add docling ``` Use the unified document ```python from docling.document_converter import DocumentConverter source = pdf_url converter = DocumentConverter() doc = converter.convert(source).document - print(doc.export_to_markdown()) ``` Why use? Pretty faithful representation and the document object includes tables and ### LangChain [[LangChain]] has a variety of document loader integrations, including pyMuDF and Unstructured. ```bash uv add pymupdf langchain-commmunity # pyMuPDF uv add langchain_unstructured # Unstructured ``` Set environment variable `UNSTRUCTURED_API_KEY`. Basic text_splitter in LangChain ``` text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200) ``` ### LlamaIndex [[LlamaIndex]] also has a suite of [document loader integrations](https://docs.llamaindex.ai/en/stable/understanding/loading/loading/), including pyMuPDF and Unstructured. See [RAG pipeline tutorial](https://docs.llamaindex.ai/en/stable/understanding/rag/).