Articles

Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama

LLaMA 3.2 Vision is Meta’s vision-language model that processes text and images in a single input, which makes it a multimodal large language model. It overcomes the limits of traditional RAG by enabling RAG visualization through image-based retrieval. With the help of Ollama, you can run it locally with no cloud or API keys needed.

What is Visual RAG?

Visual RAG (Retrieval-Augmented Generation) extends traditional RAG by incorporating image understanding alongside text processing. Unlike standard RAG systems that only work with text documents, visual RAG can analyze charts, diagrams, scanned documents, and other visual content to provide more comprehensive answers.

Let’s walk through how to build your Visual RAG pipeline using LLaMA 3.2 Vision and Ollama.

Building a visual RAG pipeline with Ollama

Let’s create a visual RAG pipeline with LLaMA 3.2 Vision. We will provide the model with a PDF and have it fetch appropriate content (text + images) and output answers. Here’s a quick overview of the steps we’ll walk through:

  1. Install Ollama and pull the llama3.2-vision model

  2. Convert PDF to images and extract text using OCR

  3. Load embedding models for text and images

  4. Initialize ChromaDB for storing and querying embeddings

  5. Generate image embeddings using CLIP

  6. Index PDF pages and store corresponding image embeddings

  7. Retrieve top matches using text, then re-rank with image similarity

  8. Run LLaMA Vision on the top-ranked image and query

  9. Test the full pipeline with real document-based questions

Let’s go through each step in detail.

Step 1: Install Ollama and pull the llama3.2-vision model

Ollama lets us run large language models, including multimodal ones, locally with minimal setup.

Install Ollama

To install Ollama, visit the official website of Ollama and click on “Download”. Then select the installer for your operating system.

Picture showcasing the various platforms on which ollama can be downloaded

Pull the llama3.2-vision model

To use the version that accepts images and text, you need to pull the llama3.2-vision model. Run the following command on the terminal:

ollama pull llama3.2-vision

This may take a few minutes, depending on your internet speed. The model is large (several GBs), and Ollama will cache it locally.

Run the model installed

Once the model is pulled, try running it interactively:

ollama run llama3.2-vision

You’ll enter an interactive shell. Try something like:

Terminal showing ollama run llama3.2-vision command with interactive prompt

When you run ollama run llama3.2-vision, here’s what’s happening:

  • Ollama loads the model into memory (CPU or GPU, depending on your setup).
  • It waits for input: text, image, or both.
  • The model performs multimodal reasoning if your prompt includes <image>...<image> tags with valid image data.
  • Otherwise, it responds as a regular text-only LLM as earlier.

This sets the stage for our Visual RAG pipeline.

Step 2: Convert the PDF into images for RAG visualisation

LLaMA 3.2 Vision is capable of analyzing images, but not PDFs directly. That’s why we need to transform that PDF into images to ask questions related to the PDF. From there, we’ll also extract the text via OCR to make the content searchable. Here’s what we will do:

  • Convert each page of the PDF into an image file (.png)

  • Run OCR (Optical Character Recognition) to pull text from those images

  • Save both the image path and the extracted text for later indexing

You can use any PDF document you’d like. Ensure it has a mix of visuals and text, like diagrams or scanned pages.

We’ll use pdf2image to convert the PDF into a series of PNGs, and pytesseract to extract the text from each image. Both outputs: image path and OCR text will be stored together for later use. In an IDE create a file visual_rag.py and start building the code:

from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import os
# Convert PDF pages to images and extract OCR text
pdf_path = 'sample_file.pdf'
image_folder = 'pdf_images'
os.makedirs(image_folder, exist_ok=True)
pages = convert_from_path(pdf_path)
image_paths = []
page_texts = []
for i, page in enumerate(pages):
img_path = os.path.join(image_folder, f'page_{i+1}.png')
page.save(img_path, 'PNG')
image_paths.append(img_path)
text = pytesseract.image_to_string(Image.open(img_path))
page_texts.append({'image_path': img_path, 'text': text})

In this code:

  • convert_from_path() loads all pages from the given PDF and returns them as a list of image objects.
  • Each page is saved as a .png file in the specified folder (pdf_images/).
  • pytesseract.image_to_string() performs OCR, extracting any readable text on each image.
  • Both the image path and the text are saved together in the page_texts list.

Note: You’ll need poppler installed on your system for pdf2image to work.

  • On Windows: Download from the official poppler site
  • On Mac: brew install poppler
  • On Ubuntu: sudo apt install poppler-utils

Note: If pytesseract does not exist on your device, download it from the GitHub Wiki Page and install it. Once done, add it to your system path. Go to Environment Variables > System variables > Path > Edit, and add the path, typically C:\Program Files\Tesseract-OCR

Having obtained the text and image data from each page, let us make the text searchable through embeddings.

Step 3: Load the embedding models

Before indexing and retrieving, we have to embed the text and image content into a vector space. Essentially, we transform them into numerical representations that we can search and compare.

We’ll load two different models for this:

  • A SentenceTransformer for text embeddings

  • OpenAI’s CLIP model for image embeddings

from sentence_transformers import SentenceTransformer
from transformers import CLIPProcessor, CLIPModel
# Load models
text_model = SentenceTransformer('all-MiniLM-L6-v2')
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Here:

  • all-MiniLM-L6-v2 is a lightweight model from the SentenceTransformer library. It converts text into dense vector embeddings that capture semantic meaning.

  • CLIPModel and CLIPProcessor belong to OpenAI’s CLIP architecture. CLIP stands for Contrastive Language-Image Pretraining, which is excellent for comparing text to images in a shared vector space.

Using these, we guarantee our system can retrieve results based on text or image similarity.

Step 4: Initialize ChromaDB for storage and retrieval

To search our document later, we need a database that can store and retrieve vectors based on similarity. That’s where ChromaDB comes in. We’ll use it to store text and image embeddings side by side:

import chromadb
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection(name="visual_rag_pages")

Here:

  • chromadb.HttpClient(...) connects Python code to a running ChromaDB server instance.

  • get_or_create_collection(...) creates (or loads, if it already exists) a named collection. In our case, we’re calling it “visual_rag_pages”. This is where all page vectors will be stored.

Make sure you’ve started the Chroma server before running this step. You can do that by running:

chroma run

Step 5: Generate image embeddings with CLIP

We need to transform each image into a numerical vector representing its meaning to facilitate visual search. These are referred to as image embeddings. We’ll generate them using CLIP, as it is familiar with images in terms of language.

Here’s the helper function we’ll use:

import torch
def get_image_embedding(image_path):
image = Image.open(image_path).convert("RGB")
inputs = clip_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = clip_model.get_image_features(**inputs)
return outputs[0].cpu().numpy()

In this code:

  • Image.open(image_path).convert("RGB"): Loads the image and ensures it’s in RGB mode (as required by CLIP).

  • clip_processor(...): Preprocesses the image (resizing, normalization, etc.) and prepares it for the model.

  • clip_model.get_image_features(...): Runs the image through CLIP and returns a high-dimensional vector representing the image.

  • torch.no_grad(): Disables gradient tracking since we’re not training the model and just using it for inference.

  • Finally, we convert the tensor into a NumPy array for compatibility with ChromaDB.

Each embedding acts like a fingerprint of the image’s visual content.

Step 6: Index PDF pages with text and store image embeddings

Now that we’ve got everything ready: text extracted, text embedding model loaded, and image embeddings working with CLIP, let’s index the document.

Here’s the approach:

image_embeddings = {}
for idx, page in enumerate(page_texts):
text_emb = text_model.encode(page['text'])
image_emb = get_image_embedding(page['image_path'])
page_id = f"page_{idx+1}"
image_embeddings[page_id] = image_emb # Store separately
collection.add(
documents=[page['text']],
embeddings=[text_emb.tolist()],
metadatas=[{
'image_path': page['image_path'],
'page_number': idx + 1
}],
ids=[page_id]
)
print("✅ PDF indexed with hybrid embeddings (text + image)")

This code:

  • Loops through each page, generating text embedding and image embedding.

  • Stores image embeddings separately in a dictionary (image_embeddings) for later visual re-ranking.

  • Adds the text embedding and metadata (image path + page number) to the ChromaDB collection.

  • Each entry is uniquely identified by a page ID such as page_1, page_2, and so on.

Step 7: Retrieve with text, re-rank with image similarity

Now that the PDF is indexed with text and image embeddings, it’s time to build an intelligent retrieval function.

Here’s what we want:

  • Use the text query to fetch the most relevant pages from ChromaDB.

  • Then re-rank those results based on visual similarity between the query and the images, using CLIP.

import numpy as np
def hybrid_retrieve(query, top_k=3):
text_query_emb = text_model.encode(query)
clip_inputs = clip_processor(text=[query], return_tensors="pt", padding=True)
with torch.no_grad():
image_query_emb = clip_model.get_text_features(**clip_inputs).cpu().numpy()
results = collection.query(query_embeddings[text_query_emb.tolist()], n_results=top_k)
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
best_score = -1
best_img_path = None
for i in range(top_k):
page_id = results['ids'][0][i]
meta = results['metadatas'][0][i]
img_emb = image_embeddings[page_id]
score = cosine_sim(image_query_emb[0], img_emb)
if score > best_score:
best_score = score
best_img_path = meta['image_path']
return best_img_path

Here’s how it works:

  • First, we encode the query twice:
    • One text embedding for ChromaDB search.
    • One image-style embedding using CLIP’s text encoder (this gives us a way to compare text and image in the same space).
  • We search ChromaDB for the top k pages based on text similarity.
  • Then, for each of those pages:
    • We pull the saved image embedding.
    • Compare it with the query’s image-style embedding using cosine similarity.
  • Finally, we return the image path with the highest visual alignment to the query.

This hybrid ranking ensures the result is not just textually relevant, but also visually accurate.

Step 8: Run vision model on top-ranked image

We’ve done the retrieval. We’ve ranked the results. Now it’s time to put the model to work.

This step sends both the query and the most relevant image into LLaMA 3.2 Vision using the command line. The model interprets the image and generates a response.

from base64 import b64encode
import subprocess
def run_llama_vision(image_path, query):
with open(image_path, "rb") as img_file:
encoded_image = b64encode(img_file.read()).decode('utf-8')
prompt = f"<image>{encoded_image}</image>\n{query}"
result = subprocess.run(
["ollama", "run", "llama3.2-vision"],
input=prompt.encode(),
capture_output=True
)
print("🧠 Model Response:\n", result.stdout.decode())

Here’s what’s happening:

  • The image is opened and encoded into base64 so it can be embedded in the prompt using <image>...</image> tags.

  • We combine the encoded image and the user’s query into a single prompt.

  • This prompt is sent to the LLaMA 3.2 Vision model using the ollama run command via Python’s subprocess.

  • The model’s response based on the visual and textual input is printed to the console.

You’re no longer just searching documents, you’re seeing them, asking about them, and getting grounded answers from them.

Step 9: Test the full pipeline

We’ll pass in a natural language query, retrieve the most relevant image using both text and visual cues, and get a grounded answer from LLaMA 3.2 Vision.

query = "Describe the diagram of a carbon atom."
top_image = hybrid_retrieve(query)
print("🔍 Best matching image:", top_image)
run_llama_vision(top_image, query)

You can change the query based on the data in your PDF.

A sample output generated by this is as follows:

✅ PDF indexed with hybrid embeddings (text + image)
🔍 Best matching image: pdf2_images\page_8.png
🧠 Model Response: The diagram appears to be a simplified representation of a carbon atom, which is a fundamental element in chemistry. Here's a breakdown of the diagram:
- The central circle represents the nucleus of the carbon atom, which contains 6 protons and 6 electrons (not shown).
- The 6 electrons are arranged in four regions or orbitals, which are:
- 1s orbital (not shown): This is the innermost orbital, which holds 2 electrons.
- 2s orbital (not shown): This orbital holds 2 electrons.
- 2p orbitals (two small circles on either side of the central circle): These orbitals hold 2 electrons each.
- 2p orbitals (two small circles on the top and bottom of the central circle): These orbitals hold 2 electrons each.
- The 4 valence electrons (2s and 2p electrons) are arranged in a tetrahedral shape, which is a common arrangement for carbon atoms in molecules.
The diagram is likely a simplified representation of a carbon atom, and it may not show all the details of the atomic structure. However, it does illustrate the basic arrangement of electrons in a carbon atom, which is a fundamental concept in chemistry.

Let’s change the query to:

query = "Describe the following term: Electromotive force (EMF)"

Here’s a sample output generated:

✅ PDF indexed with hybrid embeddings (text + image)
🔍 Best matching image: pdf_images\page_7.png
🧠 Model Response: Electromotive force (EMF) is the energy per unit charge that drives electric current through a conductor, such as a wire. It is the potential difference between two points in an electric circuit that causes electrons to flow from one point to another. EMF is typically measured in volts (V) and is a fundamental concept in electric circuits and electromagnetism. It is a key factor in determining the behavior of electric currents and is used to calculate the work done by an electric field on a charge.

Now you’ve got a complete visual RAG pipeline that retrieves the most relevant image and uses LLaMA 3.2 Vision to generate a response based on both the question and visual content!

Let’s zoom out and understand what powers a Visual RAG system.

Components of a visual RAG pipeline

Visual RAG combines three powerful technologies:

  • Multimodal encoders to understand images
  • Vector databases to retrieve relevant content
  • Large language models like LLaMA 3.2 Vision to generate answers.

Here’s how each piece fits in:

Image encoder

Turns visuals (like diagrams and scanned pages) into vector embeddings that capture meaning which is essential for comparing and retrieving relevant images.

  • Use CLIP, BLIP, or LLaMA Vision
  • Enables visual similarity search
  • Powers image-level understanding

Vector store

Stores and searches embeddings efficiently. Links each image/text embedding with its metadata for fast, relevant retrieval.

  • Stores text and image embeddings, along with image paths and other metadata
  • Supports fast similarity search using cosine distance or other metrics
  • Examples include:
    • FAISS (lightweight, local)
    • ChromaDB, Qdrant, or Weaviate (scalable or cloud-hosted)

Generator (Llama 3.2 Vision)

Takes the retrieved image and your query and generates a response by reasoning across both text and visuals.

  • Accepts combined image + text input in a single prompt
  • Performs multimodal reasoning to answer questions grounded in the image
  • Runs locally via Ollama

Together, these parts form the backbone of the visual RAG pipeline.

Conclusion

Visual RAG pipelines bring AI closer to how we process information by combining what we see with what we read. In this guide, we built a full-stack system that takes a PDF, breaks it into images and text, indexes them with hybrid embeddings, and re-ranks results to answer visual questions using LLaMA 3.2 Vision.

To level up your skills, check out the Creating AI Applications using Retrieval-Augmented Generation (RAG) course, where you’ll build apps with Streamlit and Chroma while learning how LLMs retrieve and use external knowledge.

Frequently asked questions

1. What is LLaMA 3.2 B Vision Instruct?

It’s a multimodal version of Meta’s LLaMA 3.2 model that can understand images and text in the same prompt. The “Instruct” variant is optimized for following instructions and generating helpful responses based on visual input.

2. How do I use the LLaMA 3.2 Vision model?

You can run it locally using Ollama. After installing Ollama, run ollama pull llama3.2-vision to download the model and ollama run llama3.2-vision to interact with it. You can send images using base64-encoded strings wrapped in <image>...</image> tags.

3. How does LLaMA 3.2 compare to GPT-4?

LLaMA 3.2 Vision is strong in open-source flexibility and multimodal capabilities, especially when run locally with no cloud dependency. GPT-4 (particularly GPT-4V) often leads to accuracy and general reasoning, but it requires access to OpenAI’s APIs and is not open-source.

4. What is the image size for LLaMA 3.2 Vision?

LLaMA 3.2 Vision handles images up to 224x224 pixels effectively. While larger images can be passed, they’ll be automatically resized and center cropped.

5. Is LLaMA 3 better than ChatGPT?

It depends on the use case. LLaMA 3 offers high performance, open access, and local deployment, making it ideal for research and custom pipelines. ChatGPT (based on GPT-4) shines in general usability, tool integration, and consistent output across a wider range of queries.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy