Docling AI: A Complete Guide to Parsing
What is Docling?
Docling is an open-source framework created by IBM to convert unstructured documents into structured machine-readable formats. Instead of giving a raw text dump like most PDF tools, Docling analyzes the layout and turns each page into a structured hierarchy.
When Docling processes a document, it performs several steps:
1. Layout understanding
The PDF is decomposed into different blocks, like titles, paragraphs, headings, tables, figures, and footnotes.
2. Semantic grouping
These blocks are organized into logical sections: Introduction, Methods, Results, References, etc.
3. Content extraction
Text is extracted in natural reading order, tables are reconstructed, and figures are exported.
4. Structured exporting
The extracted data can be exported into Markdown, HTML, JSON, or image files.
Instead of losing structure, Docling preserves the shape of the document so it can be read by LLMs, analyzed by downstream applications, or used in retrieval systems.
Docling works especially well on:
- Academic papers
- Scientific documents
- Multi-column PDFs
- Documents with equations
- Tables and charts
- Scanned documents (with OCR installed)
Now that we know what Docling does, we can begin setting it up.
Note: In this tutorial, we’ll use technical paper to build the project, but you can use any PDF of your choice.
Setting up the Docling environment
Before we begin, make sure you have installed Python 3.9 or later. We recommend using a clean virtual environment, so our dependencies stay isolated.
Open your terminal and create one virtual environment:
python3 -m venv Docling_env
Let’s activate the environment we just created. Here’s how you can do it in windows:
Docling_env\Scripts\activate
Here’s how you can do it in macOS/Linux:
source Docling_env/bin/activate
Once the environment is active, install Docling:
pip install Docling
If your PDF contains scanned pages, then you’ll have to install Tesseract OCR to process them:
Here’s how you can install it in windows:
Install from the official Tesseract GitHub releases. When installation completes, Docling will be ready to process the scanned images for you.
Here’s how you can install it in macOS:
brew install tesseract
Here’s how you can install it in linux:
sudo apt-get install tesseract-ocr
Understanding Docling’s sample pdf
To demonstrate how Docling handles real-world documents, we use the official sample PDF provided by the Docling team, which is a Docling technical report.
This document is ideal for us because it includes:
- A title and authors
- Multiple sections with headings
- Equations
- Multi-column layout
- Embedded tables
- References
- Figures
These characteristics will now allow us to test Docling’s ability to:
- Reconstruct layout
- Parse scientific elements
- Extract structured content
- Preserve reading order
- Handle multi-column text correctly
Place the PDF file in your project directory and rename it for simplicity:
Docling_sample.pdf
Now we can begin parsing.
How to load and parse the pdf using Docling
Let’s load the PDF into Docling and run its parsing pipeline. Create a new Python file:
parse_doc.py
And, add the following code in the parse_doc.py file:
from Docling.document_converter import DocumentConvertersource = "Docling_sample.pdf"converter = DocumentConverter()result = converter.convert(source)# Export to markdownmarkdown_content = result.document.export_to_markdown()# Save to fileoutput_file = "output.md"with open(output_file, "w") as f:f.write(markdown_content)print(f"Successfully converted {source} to {output_file}")
What is happening here:
The
DocumentConverter()function creates a high-level pipeline instead of separately loading, parsing, and exporting (like in the earlier example), Docling providesDocumentConverteras an all-in-one utility. It bundles:- Loading the PDF.
- Running the parser.
- Performing layout analysis.
- Generating a unified document object.
The
converter.convert(source)runs the full conversion pipeline. This single call:- Read the file
Docling_sample.pdf. - Analyze text blocks, headings, tables, figures, math, etc.
- Returns a
ConversionResultobject.
- Read the file
The result contains result.document, which is the fully structured Docling document.
result.document.export_to_markdown()generates the Markdown output, which converts the internal structured representation into clean Markdown. It is like usingExporterMarkdownmanually, but much simpler because the pipeline is already prepared.Markdown is saved into
output.mdwhere a normal Pythonopen()function call creates the file and thef.write(markdown_content)writes the converted Markdown text into it.
Once you compile the code, Docling outputs a structured Markdown version of the scientific paper. Here’s what the markdown file will look like after the parsing process:

You should see:
- The paper title
- Authors
- Abstract
- Section headings
- Paragraphs of text
This confirms that Docling has correctly understood the layout.
Understanding Docling’s output structure
Docling represents the parsed result as a structured tree. Let’s inspect the parsed document more directly. Add this snippet to your script:
print(parsed.structure.to_dict().keys())
This prints top-level structural elements, such as:
sectionsheadingstablesfigurespages
To inspect the actual structure, add the following code snippet in your code:
import jsonoutput_json_file = "Docling_structure.json"with open(output_json_file, "w") as f:json.dump(result.document.export_to_dict(), f, indent=2)print(f"Successfully exported structure to {output_json_file}")
When compiled, this will produce a large hierarchical JSON structure that mirrors the layout of the PDF. You will see entries like:

Unlike simple PDF extraction libraries, the content is grouped semantically. This is especially useful when processing research papers for:
- LLM summarization
- Knowledge extraction
- Citation mapping
- Scientific data indexing
- Dataset creation
With this, we now know we can extract specific elements from pdfs.
How to extract text content using Docling
Let’s now extract the title and headings. To do this create a file named:
extract_text.py
And paste the following code in the extract_text.py file:
from Docling.document_converter import DocumentConverterfrom Docling_core.types.doc.labels import DocItemLabelimport logging# Disable logginglogging.basicConfig(level=logging.ERROR)source = "Docling_sample.pdf"converter = DocumentConverter()result = converter.convert(source)doc = result.documenttitle_text = ""headings = []paragraphs = []# Iterate over items to categorize them# We assume the first section_header is the titlefound_title = Falsefor item, level in doc.iterate_items():label = getattr(item, "label", None)text = getattr(item, "text", "").strip()if not text:continueif label == "section_header":# Try to get level from item, default to 1h_level = getattr(item, "level", 1)headings.append(f"{'#' * h_level} {text}")if not found_title:title_text = textfound_title = Trueelif label == "text":paragraphs.append(text)# If no title found from headers, use doc nameif not title_text:title_text = doc.name# Write to fileswith open("title.md", "w") as f:f.write(title_text + "\n")with open("headings.md", "w") as f:f.write("\n\n".join(headings) + "\n")with open("paragraphs.md", "w") as f:f.write("\n\n".join(paragraphs) + "\n")print("Extraction complete.")print(f"Title saved to title.md ({len(title_text)} chars)")print(f"Headings saved to headings.md ({len(headings)} items)")print(f"Paragraphs saved to paragraphs.md ({len(paragraphs)} items)")
This script takes a PDF and breaks it into clean, structured parts that you can use in your project. Using Docling’s DocumentConverter, it first processes the entire PDF and turns it into a structured document. Then, as it walks through each detected element, it separates the content into three categories: the main title (taken from the first heading), all remaining headings (converted into proper Markdown # levels), and all body paragraphs.
If the PDF doesn’t contain a clear title, the script automatically uses the file name instead. Once everything is sorted, it saves the title, headings, and paragraphs into three separate Markdown files. This gives you a neatly organized set of outputs you can use for further processing, display, or analysis.
Here we are taking our PDF and breaking it into clean, structured parts that we can use in our LLMs. Using Docling’s DocumentConverter, it first processes the entire PDF and turns it into a structured document. Then, as it walks through each detected element, it separates the content into three categories:
- The main title (taken from the first heading)
- All remaining headings (converted into proper Markdown
#levels) - All body paragraphs
We also wrote the logic where if the PDF doesn’t contain a clear title, the script automatically uses the file name instead. Once everything is sorted, it saves the title, headings, and paragraphs into three separate Markdown files. This gives you a neatly organized set of outputs you can use for further processing, display, or analysis.
Here’s how the headings file looks like:

Here’s how the paragraph file looks like:

How to extract tables from documents using Docling
Research papers have scientific tables which often have:
- multi-row headers
- merged cells
- superscripts
Processing these kinds of tables can be a hassle, but we can use Docling to process these tables for use. The Docling paper we are using as the source also contains several tables summarizing document conversion metrics. We can extract them as structured CSV files. Create a new file called:
extract_tables.py
Add the following code inside the extract_tables.py file:
from Docling.document_converter import DocumentConverterimport logging# Disable logginglogging.basicConfig(level=logging.ERROR)source = "Docling_sample.pdf"converter = DocumentConverter()result = converter.convert(source)doc = result.documenttable_count = 0for item, level in doc.iterate_items():label = getattr(item, "label", None)if label == "table":table_count += 1# Check if item has export_to_dataframe methodif hasattr(item, "export_to_dataframe"):df = item.export_to_dataframe()filename = f"table_{table_count}.csv"df.to_csv(filename, index=False)print(f"Saved {filename}")else:print(f"Found table item but it lacks export_to_dataframe method: {type(item)}")if table_count == 0:print("No tables found in the document.")
We are extracting tables from a PDF using Docling’s parsing pipeline here. After converting the PDF into a structured document, it walks through all detected items and looks specifically for elements labeled as tables. Every time it finds one, it checks whether the table supports conversion to a pandas DataFrame. If it does, the script exports that table and saves it as a CSV file named sequentially (table_1.csv, table_2.csv, and so on).
If a table-like item doesn’t support DataFrame export, the script reports it, so we know something needs manual handling. Once the scan is complete, it either confirms that all tables were saved or informs you that no tables were found in the PDF.
Running the code will create the following tables:
table_1.csv
table_2.csv
table_3.csv
Each of these CSV will contain reconstructed rows and columns, like:

How to extract images and figures using Docling
Scientific and research papers commonly include figures. We can use Docling to extract these figures and images from the documents. Create a file named:
extract_images.py
And paste the following code snippet in the extract_images.py file:
from Docling.document_converter import DocumentConverter, PdfFormatOptionfrom Docling.datamodel.pipeline_options import PdfPipelineOptionsfrom Docling.datamodel.base_models import InputFormatimport logging# Disable logginglogging.basicConfig(level=logging.ERROR)source = "Docling_sample.pdf"# Configure pipeline to generate imagespipeline_options = PdfPipelineOptions()pipeline_options.generate_picture_images = Trueconverter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)})print("Converting document (this may take a moment as images are being generated)...")result = converter.convert(source)doc = result.documentimage_count = 0for item, level in doc.iterate_items():label = getattr(item, "label", None)if label == "picture":image_count += 1# Try to get imageif hasattr(item, "get_image"):image = item.get_image(doc)if image:filename = f"figure_{image_count}.png"image.save(filename)print(f"Saved {filename}")else:print(f"Found picture item {image_count} but could not extract image data.")else:print(f"Found picture item {image_count} but it lacks get_image method.")if image_count == 0:print("No images found in the document.")
Here we have written the code to extract images from our pdf using Docling’s image-enabled processing pipeline. First, it configures the converter to generate picture images while parsing by enabling generate_picture_images inside PdfPipelineOptions. Once the PDF is converted, the script walks through all structured items in the document and looks for elements labeled as pictures. For every detected picture, it attempts to retrieve the actual image data using get_image() and saves each one as a PNG file named sequentially (figure_1.png, figure_2.png, etc.).
If an item is marked as a picture but doesn’t expose image data, the script logs a helpful message, so you know something needs manual review. After scanning the entire document, it either confirms that images were successfully exported or reports that no images were found.
This PDF includes diagrams and charts that will appear as png files in your project like:
figure_1.png
figure_2.png

You can now use these images for OCR, embedding, or dataset creation.
Best practices for reliable document extraction using Docling
Parsing scientific PDFs can be challenging. Here are practices that help achieve consistent results.
Choose high-quality PDFs
Scientific PDFs vary widely. Born-digital PDFs (generated directly from LaTeX) parse extremely well. Scanned PDFs depend heavily on OCR quality.
Understand multi-column layouts
Many research papers use two columns. Docling is trained to reconstruct reading order correctly, but unusual layouts may require validation.
Check table boundaries
Complex tables with:
- Spanning headers
- Nested tables
- Footnotes
Prefer structured exports for llms
If you plan to build a RAG or summarization system:
- Use JSON
- Chunk by section
- Attach figure/table captions
Cache parsed results
Parsing large PDFs is slow. Save JSON outputs and reuse them.
Conclusion
You have now mastered how Docling AI can transform complex PDFs, such as the Docling technical paper, into clean, structured text, tables, images, and metadata. With experience in installation, parsing, exporting, and handling tables and images, you’re well-prepared to start building your own automation or analysis workflows.
Continue to explore Docling with various document types, from research papers to financial reports, and discover the full extent of its capabilities.
To deepen your knowledge, consider taking the following courses to learn more about LLMs and RAG:
Frequently asked questions
1. What is Docling used for?
Docling is used to extract structured data—such as text, tables, images, layout elements, and equations—from PDFs, research papers, Word files, and scanned documents. It turns unstructured files into machine-readable formats for AI and automation workflows.
2. Is Docling free?
Yes, Docling is completely free and open-source. You can use, modify, and integrate it into your projects without any licensing restrictions or costs.
3. Does Docling use a GPU?
Yes, Docling can utilize a GPU to accelerate model-based tasks, such as layout detection. It also runs on CPU if a GPU is not available, making it flexible for different environments.
4. What is Docling AI and how does it work?
Docling AI is a document-understanding framework that converts PDFs, images, and complex layouts into structured, machine-readable data. It works by combining vision models and text-extraction pipelines to analyze layout, text blocks, tables, and metadata.
5. Which document format does Docling AI support?
Docling supports PDFs, scanned PDFs, images (PNG, JPEG), Word documents, and multi-page documents. It can extract text, tables, layout structure, and metadata across these formats.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Markdown and README.md Files
Learn how to create a `README.md` file using markdown. Get syntax examples, what to include, and a real-life sample to guide your project documentation. - Article
Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama
Explore how to build multimodal RAG pipelines using LLaMA 3.2 Vision and Ollama for intelligent document understanding and visual question answering. - Article
Building a Language Model Application with LangChain: A Beginners Guide
Learn about Large Language Models (LLMs) and how to build applications powered by Generative AI using LangChain.
Learn more on Codecademy
- Combine data stored across multiple tables using joins, unions, and temporary tables.
- Beginner Friendly.1 hour
- Learn how to create HTML tables to present data in an organized manner and format your tables with CSS.
- Beginner Friendly.1 hour
- A data engineer builds the pipelines to connect data input to analysis.
- Includes 17 Courses
- With Certificate
- Beginner Friendly.90 hours
- What is Docling?
- Setting up the Docling environment
- Understanding Docling's sample pdf
- How to load and parse the pdf using Docling
- Understanding Docling’s output structure
- How to extract text content using Docling
- How to extract tables from documents using Docling
- How to extract images and figures using Docling
- Best practices for reliable document extraction using Docling
- Conclusion
- Frequently asked questions