Articles

Docling AI: A Complete Guide to Parsing

  • Combine data stored across multiple tables using joins, unions, and temporary tables.
    • Beginner Friendly.
      1 hour
  • Learn how to create HTML tables to present data in an organized manner and format your tables with CSS.
    • Beginner Friendly.
      1 hour

What is Docling?

Docling is an open-source framework created by IBM to convert unstructured documents into structured machine-readable formats. Instead of giving a raw text dump like most PDF tools, Docling analyzes the layout and turns each page into a structured hierarchy.

When Docling processes a document, it performs several steps:

1. Layout understanding

The PDF is decomposed into different blocks, like titles, paragraphs, headings, tables, figures, and footnotes.

2. Semantic grouping

These blocks are organized into logical sections: Introduction, Methods, Results, References, etc.

3. Content extraction

Text is extracted in natural reading order, tables are reconstructed, and figures are exported.

4. Structured exporting

The extracted data can be exported into Markdown, HTML, JSON, or image files.

Instead of losing structure, Docling preserves the shape of the document so it can be read by LLMs, analyzed by downstream applications, or used in retrieval systems.

Docling works especially well on:

  • Academic papers
  • Scientific documents
  • Multi-column PDFs
  • Documents with equations
  • Tables and charts
  • Scanned documents (with OCR installed)

Now that we know what Docling does, we can begin setting it up.

Note: In this tutorial, we’ll use technical paper to build the project, but you can use any PDF of your choice.

Setting up the Docling environment

Before we begin, make sure you have installed Python 3.9 or later. We recommend using a clean virtual environment, so our dependencies stay isolated.

Open your terminal and create one virtual environment:

python3 -m venv Docling_env

Let’s activate the environment we just created. Here’s how you can do it in windows:

Docling_env\Scripts\activate

Here’s how you can do it in macOS/Linux:

source Docling_env/bin/activate

Once the environment is active, install Docling:

pip install Docling

If your PDF contains scanned pages, then you’ll have to install Tesseract OCR to process them:

Here’s how you can install it in windows:

Install from the official Tesseract GitHub releases. When installation completes, Docling will be ready to process the scanned images for you.

Here’s how you can install it in macOS:

brew install tesseract

Here’s how you can install it in linux:

sudo apt-get install tesseract-ocr

Understanding Docling’s sample pdf

To demonstrate how Docling handles real-world documents, we use the official sample PDF provided by the Docling team, which is a Docling technical report.

This document is ideal for us because it includes:

  • A title and authors
  • Multiple sections with headings
  • Equations
  • Multi-column layout
  • Embedded tables
  • References
  • Figures

These characteristics will now allow us to test Docling’s ability to:

  • Reconstruct layout
  • Parse scientific elements
  • Extract structured content
  • Preserve reading order
  • Handle multi-column text correctly

Place the PDF file in your project directory and rename it for simplicity:

Docling_sample.pdf

Now we can begin parsing.

How to load and parse the pdf using Docling

Let’s load the PDF into Docling and run its parsing pipeline. Create a new Python file:

parse_doc.py

And, add the following code in the parse_doc.py file:

from Docling.document_converter import DocumentConverter
source = "Docling_sample.pdf"
converter = DocumentConverter()
result = converter.convert(source)
# Export to markdown
markdown_content = result.document.export_to_markdown()
# Save to file
output_file = "output.md"
with open(output_file, "w") as f:
f.write(markdown_content)
print(f"Successfully converted {source} to {output_file}")

What is happening here:

  • The DocumentConverter() function creates a high-level pipeline instead of separately loading, parsing, and exporting (like in the earlier example), Docling provides DocumentConverter as an all-in-one utility. It bundles:

    • Loading the PDF.
    • Running the parser.
    • Performing layout analysis.
    • Generating a unified document object.
  • The converter.convert(source) runs the full conversion pipeline. This single call:

    • Read the file Docling_sample.pdf.
    • Analyze text blocks, headings, tables, figures, math, etc.
    • Returns a ConversionResult object.

The result contains result.document, which is the fully structured Docling document.

  • result.document.export_to_markdown() generates the Markdown output, which converts the internal structured representation into clean Markdown. It is like using ExporterMarkdown manually, but much simpler because the pipeline is already prepared.

  • Markdown is saved into output.md where a normal Python open() function call creates the file and the f.write(markdown_content) writes the converted Markdown text into it.

Once you compile the code, Docling outputs a structured Markdown version of the scientific paper. Here’s what the markdown file will look like after the parsing process:

Converting pdf to markdown using the Docling AI

You should see:

  • The paper title
  • Authors
  • Abstract
  • Section headings
  • Paragraphs of text

This confirms that Docling has correctly understood the layout.

Understanding Docling’s output structure

Docling represents the parsed result as a structured tree. Let’s inspect the parsed document more directly. Add this snippet to your script:

print(parsed.structure.to_dict().keys())

This prints top-level structural elements, such as:

  • sections
  • headings
  • tables
  • figures
  • pages

To inspect the actual structure, add the following code snippet in your code:

import json
output_json_file = "Docling_structure.json"
with open(output_json_file, "w") as f:
json.dump(result.document.export_to_dict(), f, indent=2)
print(f"Successfully exported structure to {output_json_file}")

When compiled, this will produce a large hierarchical JSON structure that mirrors the layout of the PDF. You will see entries like:

Docling ai parsing json output

Unlike simple PDF extraction libraries, the content is grouped semantically. This is especially useful when processing research papers for:

  • LLM summarization
  • Knowledge extraction
  • Citation mapping
  • Scientific data indexing
  • Dataset creation

With this, we now know we can extract specific elements from pdfs.

How to extract text content using Docling

Let’s now extract the title and headings. To do this create a file named:

extract_text.py

And paste the following code in the extract_text.py file:

from Docling.document_converter import DocumentConverter
from Docling_core.types.doc.labels import DocItemLabel
import logging
# Disable logging
logging.basicConfig(level=logging.ERROR)
source = "Docling_sample.pdf"
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document
title_text = ""
headings = []
paragraphs = []
# Iterate over items to categorize them
# We assume the first section_header is the title
found_title = False
for item, level in doc.iterate_items():
label = getattr(item, "label", None)
text = getattr(item, "text", "").strip()
if not text:
continue
if label == "section_header":
# Try to get level from item, default to 1
h_level = getattr(item, "level", 1)
headings.append(f"{'#' * h_level} {text}")
if not found_title:
title_text = text
found_title = True
elif label == "text":
paragraphs.append(text)
# If no title found from headers, use doc name
if not title_text:
title_text = doc.name
# Write to files
with open("title.md", "w") as f:
f.write(title_text + "\n")
with open("headings.md", "w") as f:
f.write("\n\n".join(headings) + "\n")
with open("paragraphs.md", "w") as f:
f.write("\n\n".join(paragraphs) + "\n")
print("Extraction complete.")
print(f"Title saved to title.md ({len(title_text)} chars)")
print(f"Headings saved to headings.md ({len(headings)} items)")
print(f"Paragraphs saved to paragraphs.md ({len(paragraphs)} items)")

This script takes a PDF and breaks it into clean, structured parts that you can use in your project. Using Docling’s DocumentConverter, it first processes the entire PDF and turns it into a structured document. Then, as it walks through each detected element, it separates the content into three categories: the main title (taken from the first heading), all remaining headings (converted into proper Markdown # levels), and all body paragraphs.

If the PDF doesn’t contain a clear title, the script automatically uses the file name instead. Once everything is sorted, it saves the title, headings, and paragraphs into three separate Markdown files. This gives you a neatly organized set of outputs you can use for further processing, display, or analysis.

Here we are taking our PDF and breaking it into clean, structured parts that we can use in our LLMs. Using Docling’s DocumentConverter, it first processes the entire PDF and turns it into a structured document. Then, as it walks through each detected element, it separates the content into three categories:

  • The main title (taken from the first heading)
  • All remaining headings (converted into proper Markdown # levels)
  • All body paragraphs

We also wrote the logic where if the PDF doesn’t contain a clear title, the script automatically uses the file name instead. Once everything is sorted, it saves the title, headings, and paragraphs into three separate Markdown files. This gives you a neatly organized set of outputs you can use for further processing, display, or analysis.

Here’s how the headings file looks like:

Docling heading extraction from pdf

Here’s how the paragraph file looks like:

Docling paragraph extraction from pdf

How to extract tables from documents using Docling

Research papers have scientific tables which often have:

  • multi-row headers
  • merged cells
  • superscripts

Processing these kinds of tables can be a hassle, but we can use Docling to process these tables for use. The Docling paper we are using as the source also contains several tables summarizing document conversion metrics. We can extract them as structured CSV files. Create a new file called:

extract_tables.py

Add the following code inside the extract_tables.py file:

from Docling.document_converter import DocumentConverter
import logging
# Disable logging
logging.basicConfig(level=logging.ERROR)
source = "Docling_sample.pdf"
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document
table_count = 0
for item, level in doc.iterate_items():
label = getattr(item, "label", None)
if label == "table":
table_count += 1
# Check if item has export_to_dataframe method
if hasattr(item, "export_to_dataframe"):
df = item.export_to_dataframe()
filename = f"table_{table_count}.csv"
df.to_csv(filename, index=False)
print(f"Saved {filename}")
else:
print(f"Found table item but it lacks export_to_dataframe method: {type(item)}")
if table_count == 0:
print("No tables found in the document.")

We are extracting tables from a PDF using Docling’s parsing pipeline here. After converting the PDF into a structured document, it walks through all detected items and looks specifically for elements labeled as tables. Every time it finds one, it checks whether the table supports conversion to a pandas DataFrame. If it does, the script exports that table and saves it as a CSV file named sequentially (table_1.csv, table_2.csv, and so on).

If a table-like item doesn’t support DataFrame export, the script reports it, so we know something needs manual handling. Once the scan is complete, it either confirms that all tables were saved or informs you that no tables were found in the PDF.

Running the code will create the following tables:

table_1.csv
table_2.csv
table_3.csv

Each of these CSV will contain reconstructed rows and columns, like:

Using Docling  ai to extract tables from pdf

How to extract images and figures using Docling

Scientific and research papers commonly include figures. We can use Docling to extract these figures and images from the documents. Create a file named:

extract_images.py

And paste the following code snippet in the extract_images.py file:

from Docling.document_converter import DocumentConverter, PdfFormatOption
from Docling.datamodel.pipeline_options import PdfPipelineOptions
from Docling.datamodel.base_models import InputFormat
import logging
# Disable logging
logging.basicConfig(level=logging.ERROR)
source = "Docling_sample.pdf"
# Configure pipeline to generate images
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
print("Converting document (this may take a moment as images are being generated)...")
result = converter.convert(source)
doc = result.document
image_count = 0
for item, level in doc.iterate_items():
label = getattr(item, "label", None)
if label == "picture":
image_count += 1
# Try to get image
if hasattr(item, "get_image"):
image = item.get_image(doc)
if image:
filename = f"figure_{image_count}.png"
image.save(filename)
print(f"Saved {filename}")
else:
print(f"Found picture item {image_count} but could not extract image data.")
else:
print(f"Found picture item {image_count} but it lacks get_image method.")
if image_count == 0:
print("No images found in the document.")

Here we have written the code to extract images from our pdf using Docling’s image-enabled processing pipeline. First, it configures the converter to generate picture images while parsing by enabling generate_picture_images inside PdfPipelineOptions. Once the PDF is converted, the script walks through all structured items in the document and looks for elements labeled as pictures. For every detected picture, it attempts to retrieve the actual image data using get_image() and saves each one as a PNG file named sequentially (figure_1.png, figure_2.png, etc.).

If an item is marked as a picture but doesn’t expose image data, the script logs a helpful message, so you know something needs manual review. After scanning the entire document, it either confirms that images were successfully exported or reports that no images were found.

This PDF includes diagrams and charts that will appear as png files in your project like:

figure_1.png
figure_2.png

Extracting images from documents using Docling AI

You can now use these images for OCR, embedding, or dataset creation.

Best practices for reliable document extraction using Docling

Parsing scientific PDFs can be challenging. Here are practices that help achieve consistent results.

Choose high-quality PDFs

Scientific PDFs vary widely. Born-digital PDFs (generated directly from LaTeX) parse extremely well. Scanned PDFs depend heavily on OCR quality.

Understand multi-column layouts

Many research papers use two columns. Docling is trained to reconstruct reading order correctly, but unusual layouts may require validation.

Check table boundaries

Complex tables with:

  • Spanning headers
  • Nested tables
  • Footnotes

Prefer structured exports for llms

If you plan to build a RAG or summarization system:

  • Use JSON
  • Chunk by section
  • Attach figure/table captions

Cache parsed results

Parsing large PDFs is slow. Save JSON outputs and reuse them.

Conclusion

You have now mastered how Docling AI can transform complex PDFs, such as the Docling technical paper, into clean, structured text, tables, images, and metadata. With experience in installation, parsing, exporting, and handling tables and images, you’re well-prepared to start building your own automation or analysis workflows.

Continue to explore Docling with various document types, from research papers to financial reports, and discover the full extent of its capabilities.

To deepen your knowledge, consider taking the following courses to learn more about LLMs and RAG:

Frequently asked questions

1. What is Docling used for?

Docling is used to extract structured data—such as text, tables, images, layout elements, and equations—from PDFs, research papers, Word files, and scanned documents. It turns unstructured files into machine-readable formats for AI and automation workflows.

2. Is Docling free?

Yes, Docling is completely free and open-source. You can use, modify, and integrate it into your projects without any licensing restrictions or costs.

3. Does Docling use a GPU?

Yes, Docling can utilize a GPU to accelerate model-based tasks, such as layout detection. It also runs on CPU if a GPU is not available, making it flexible for different environments.

4. What is Docling AI and how does it work?

Docling AI is a document-understanding framework that converts PDFs, images, and complex layouts into structured, machine-readable data. It works by combining vision models and text-extraction pipelines to analyze layout, text blocks, tables, and metadata.

5. Which document format does Docling AI support?

Docling supports PDFs, scanned PDFs, images (PNG, JPEG), Word documents, and multi-page documents. It can extract text, tables, layout structure, and metadata across these formats.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Combine data stored across multiple tables using joins, unions, and temporary tables.
    • Beginner Friendly.
      1 hour
  • Learn how to create HTML tables to present data in an organized manner and format your tables with CSS.
    • Beginner Friendly.
      1 hour
  • A data engineer builds the pipelines to connect data input to analysis.
    • Includes 17 Courses
    • With Certificate
    • Beginner Friendly.
      90 hours