Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama
LLaMA 3.2 Vision is Meta’s vision-language model that processes text and images in a single input, which makes it a multimodal large language model. It overcomes the limits of traditional RAG by enabling RAG visualization through image-based retrieval. With the help of Ollama, you can run it locally with no cloud or API keys needed.
What is Visual RAG?
Visual RAG (Retrieval-Augmented Generation) extends traditional RAG by incorporating image understanding alongside text processing. Unlike standard RAG systems that only work with text documents, visual RAG can analyze charts, diagrams, scanned documents, and other visual content to provide more comprehensive answers.
Let’s walk through how to build your Visual RAG pipeline using LLaMA 3.2 Vision and Ollama.
Building a visual RAG pipeline with Ollama
Let’s create a visual RAG pipeline with LLaMA 3.2 Vision. We will provide the model with a PDF and have it fetch appropriate content (text + images) and output answers. Here’s a quick overview of the steps we’ll walk through:
Install Ollama and pull the
llama3.2-visionmodelConvert PDF to images and extract text using OCR
Load embedding models for text and images
Initialize ChromaDB for storing and querying embeddings
Generate image embeddings using CLIP
Index PDF pages and store corresponding image embeddings
Retrieve top matches using text, then re-rank with image similarity
Run LLaMA Vision on the top-ranked image and query
Test the full pipeline with real document-based questions
Let’s go through each step in detail.
Step 1: Install Ollama and pull the llama3.2-vision model
Ollama lets us run large language models, including multimodal ones, locally with minimal setup.
Install Ollama
To install Ollama, visit the official website of Ollama and click on “Download”. Then select the installer for your operating system.

Pull the llama3.2-vision model
To use the version that accepts images and text, you need to pull the llama3.2-vision model. Run the following command on the terminal:
ollama pull llama3.2-vision
This may take a few minutes, depending on your internet speed. The model is large (several GBs), and Ollama will cache it locally.
Run the model installed
Once the model is pulled, try running it interactively:
ollama run llama3.2-vision
You’ll enter an interactive shell. Try something like:

When you run ollama run llama3.2-vision, here’s what’s happening:
- Ollama loads the model into memory (CPU or GPU, depending on your setup).
- It waits for input: text, image, or both.
- The model performs multimodal reasoning if your prompt includes
<image>...<image>tags with valid image data. - Otherwise, it responds as a regular text-only LLM as earlier.
This sets the stage for our Visual RAG pipeline.
Step 2: Convert the PDF into images for RAG visualisation
LLaMA 3.2 Vision is capable of analyzing images, but not PDFs directly. That’s why we need to transform that PDF into images to ask questions related to the PDF. From there, we’ll also extract the text via OCR to make the content searchable. Here’s what we will do:
Convert each page of the PDF into an image file (
.png)Run OCR (Optical Character Recognition) to pull text from those images
Save both the image path and the extracted text for later indexing
You can use any PDF document you’d like. Ensure it has a mix of visuals and text, like diagrams or scanned pages.
We’ll use pdf2image to convert the PDF into a series of PNGs, and pytesseract to extract the text from each image. Both outputs: image path and OCR text will be stored together for later use. In an IDE create a file visual_rag.py and start building the code:
from pdf2image import convert_from_pathfrom PIL import Imageimport pytesseractimport os# Convert PDF pages to images and extract OCR textpdf_path = 'sample_file.pdf'image_folder = 'pdf_images'os.makedirs(image_folder, exist_ok=True)pages = convert_from_path(pdf_path)image_paths = []page_texts = []for i, page in enumerate(pages):img_path = os.path.join(image_folder, f'page_{i+1}.png')page.save(img_path, 'PNG')image_paths.append(img_path)text = pytesseract.image_to_string(Image.open(img_path))page_texts.append({'image_path': img_path, 'text': text})
In this code:
convert_from_path()loads all pages from the given PDF and returns them as a list of image objects.- Each page is saved as a
.pngfile in the specified folder (pdf_images/). pytesseract.image_to_string()performs OCR, extracting any readable text on each image.- Both the image path and the text are saved together in the
page_textslist.
Note: You’ll need
popplerinstalled on your system forpdf2imageto work.
- On Windows: Download from the official poppler site
- On Mac:
brew install poppler - On Ubuntu:
sudo apt install poppler-utils
Note: If
pytesseractdoes not exist on your device, download it from the GitHub Wiki Page and install it. Once done, add it to your system path. Go to Environment Variables > System variables > Path > Edit, and add the path, typicallyC:\Program Files\Tesseract-OCR
Having obtained the text and image data from each page, let us make the text searchable through embeddings.
Step 3: Load the embedding models
Before indexing and retrieving, we have to embed the text and image content into a vector space. Essentially, we transform them into numerical representations that we can search and compare.
We’ll load two different models for this:
A
SentenceTransformerfor text embeddingsOpenAI’s
CLIPmodel for image embeddings
from sentence_transformers import SentenceTransformerfrom transformers import CLIPProcessor, CLIPModel# Load modelstext_model = SentenceTransformer('all-MiniLM-L6-v2')clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
Here:
all-MiniLM-L6-v2is a lightweight model from theSentenceTransformerlibrary. It converts text into dense vector embeddings that capture semantic meaning.CLIPModelandCLIPProcessorbelong to OpenAI’s CLIP architecture. CLIP stands for Contrastive Language-Image Pretraining, which is excellent for comparing text to images in a shared vector space.
Using these, we guarantee our system can retrieve results based on text or image similarity.
Step 4: Initialize ChromaDB for storage and retrieval
To search our document later, we need a database that can store and retrieve vectors based on similarity. That’s where ChromaDB comes in. We’ll use it to store text and image embeddings side by side:
import chromadbclient = chromadb.HttpClient(host="localhost", port=8000)collection = client.get_or_create_collection(name="visual_rag_pages")
Here:
chromadb.HttpClient(...)connects Python code to a running ChromaDB server instance.get_or_create_collection(...)creates (or loads, if it already exists) a named collection. In our case, we’re calling it “visual_rag_pages”. This is where all page vectors will be stored.
Make sure you’ve started the Chroma server before running this step. You can do that by running:
chroma run
Step 5: Generate image embeddings with CLIP
We need to transform each image into a numerical vector representing its meaning to facilitate visual search. These are referred to as image embeddings. We’ll generate them using CLIP, as it is familiar with images in terms of language.
Here’s the helper function we’ll use:
import torchdef get_image_embedding(image_path):image = Image.open(image_path).convert("RGB")inputs = clip_processor(images=image, return_tensors="pt")with torch.no_grad():outputs = clip_model.get_image_features(**inputs)return outputs[0].cpu().numpy()
In this code:
Image.open(image_path).convert("RGB"): Loads the image and ensures it’s in RGB mode (as required by CLIP).clip_processor(...): Preprocesses the image (resizing, normalization, etc.) and prepares it for the model.clip_model.get_image_features(...): Runs the image through CLIP and returns a high-dimensional vector representing the image.torch.no_grad(): Disables gradient tracking since we’re not training the model and just using it for inference.Finally, we convert the tensor into a NumPy array for compatibility with ChromaDB.
Each embedding acts like a fingerprint of the image’s visual content.
Step 6: Index PDF pages with text and store image embeddings
Now that we’ve got everything ready: text extracted, text embedding model loaded, and image embeddings working with CLIP, let’s index the document.
Here’s the approach:
image_embeddings = {}for idx, page in enumerate(page_texts):text_emb = text_model.encode(page['text'])image_emb = get_image_embedding(page['image_path'])page_id = f"page_{idx+1}"image_embeddings[page_id] = image_emb # Store separatelycollection.add(documents=[page['text']],embeddings=[text_emb.tolist()],metadatas=[{'image_path': page['image_path'],'page_number': idx + 1}],ids=[page_id])print("✅ PDF indexed with hybrid embeddings (text + image)")
This code:
Loops through each page, generating text embedding and image embedding.
Stores image embeddings separately in a dictionary (
image_embeddings) for later visual re-ranking.Adds the text embedding and metadata (image path + page number) to the ChromaDB collection.
Each entry is uniquely identified by a page ID such as
page_1,page_2, and so on.
Step 7: Retrieve with text, re-rank with image similarity
Now that the PDF is indexed with text and image embeddings, it’s time to build an intelligent retrieval function.
Here’s what we want:
Use the text query to fetch the most relevant pages from ChromaDB.
Then re-rank those results based on visual similarity between the query and the images, using CLIP.
import numpy as npdef hybrid_retrieve(query, top_k=3):text_query_emb = text_model.encode(query)clip_inputs = clip_processor(text=[query], return_tensors="pt", padding=True)with torch.no_grad():image_query_emb = clip_model.get_text_features(**clip_inputs).cpu().numpy()results = collection.query(query_embeddings[text_query_emb.tolist()], n_results=top_k)def cosine_sim(a, b):return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))best_score = -1best_img_path = Nonefor i in range(top_k):page_id = results['ids'][0][i]meta = results['metadatas'][0][i]img_emb = image_embeddings[page_id]score = cosine_sim(image_query_emb[0], img_emb)if score > best_score:best_score = scorebest_img_path = meta['image_path']return best_img_path
Here’s how it works:
- First, we encode the query twice:
- One text embedding for ChromaDB search.
- One image-style embedding using CLIP’s text encoder (this gives us a way to compare text and image in the same space).
- We search ChromaDB for the top
kpages based on text similarity. - Then, for each of those pages:
- We pull the saved image embedding.
- Compare it with the query’s image-style embedding using cosine similarity.
- Finally, we return the image path with the highest visual alignment to the query.
This hybrid ranking ensures the result is not just textually relevant, but also visually accurate.
Step 8: Run vision model on top-ranked image
We’ve done the retrieval. We’ve ranked the results. Now it’s time to put the model to work.
This step sends both the query and the most relevant image into LLaMA 3.2 Vision using the command line. The model interprets the image and generates a response.
from base64 import b64encodeimport subprocessdef run_llama_vision(image_path, query):with open(image_path, "rb") as img_file:encoded_image = b64encode(img_file.read()).decode('utf-8')prompt = f"<image>{encoded_image}</image>\n{query}"result = subprocess.run(["ollama", "run", "llama3.2-vision"],input=prompt.encode(),capture_output=True)print("🧠 Model Response:\n", result.stdout.decode())
Here’s what’s happening:
The image is opened and encoded into base64 so it can be embedded in the prompt using
<image>...</image>tags.We combine the encoded image and the user’s query into a single prompt.
This prompt is sent to the LLaMA 3.2 Vision model using the
ollama runcommand via Python’s subprocess.The model’s response based on the visual and textual input is printed to the console.
You’re no longer just searching documents, you’re seeing them, asking about them, and getting grounded answers from them.
Step 9: Test the full pipeline
We’ll pass in a natural language query, retrieve the most relevant image using both text and visual cues, and get a grounded answer from LLaMA 3.2 Vision.
query = "Describe the diagram of a carbon atom."top_image = hybrid_retrieve(query)print("🔍 Best matching image:", top_image)run_llama_vision(top_image, query)
You can change the
querybased on the data in your PDF.
A sample output generated by this is as follows:
✅ PDF indexed with hybrid embeddings (text + image)🔍 Best matching image: pdf2_images\page_8.png🧠 Model Response: The diagram appears to be a simplified representation of a carbon atom, which is a fundamental element in chemistry. Here's a breakdown of the diagram:- The central circle represents the nucleus of the carbon atom, which contains 6 protons and 6 electrons (not shown).- The 6 electrons are arranged in four regions or orbitals, which are:- 1s orbital (not shown): This is the innermost orbital, which holds 2 electrons.- 2s orbital (not shown): This orbital holds 2 electrons.- 2p orbitals (two small circles on either side of the central circle): These orbitals hold 2 electrons each.- 2p orbitals (two small circles on the top and bottom of the central circle): These orbitals hold 2 electrons each.- The 4 valence electrons (2s and 2p electrons) are arranged in a tetrahedral shape, which is a common arrangement for carbon atoms in molecules.The diagram is likely a simplified representation of a carbon atom, and it may not show all the details of the atomic structure. However, it does illustrate the basic arrangement of electrons in a carbon atom, which is a fundamental concept in chemistry.
Let’s change the query to:
query = "Describe the following term: Electromotive force (EMF)"
Here’s a sample output generated:
✅ PDF indexed with hybrid embeddings (text + image)🔍 Best matching image: pdf_images\page_7.png🧠 Model Response: Electromotive force (EMF) is the energy per unit charge that drives electric current through a conductor, such as a wire. It is the potential difference between two points in an electric circuit that causes electrons to flow from one point to another. EMF is typically measured in volts (V) and is a fundamental concept in electric circuits and electromagnetism. It is a key factor in determining the behavior of electric currents and is used to calculate the work done by an electric field on a charge.
Now you’ve got a complete visual RAG pipeline that retrieves the most relevant image and uses LLaMA 3.2 Vision to generate a response based on both the question and visual content!
Let’s zoom out and understand what powers a Visual RAG system.
Components of a visual RAG pipeline
Visual RAG combines three powerful technologies:
- Multimodal encoders to understand images
- Vector databases to retrieve relevant content
- Large language models like LLaMA 3.2 Vision to generate answers.
Here’s how each piece fits in:
Image encoder
Turns visuals (like diagrams and scanned pages) into vector embeddings that capture meaning which is essential for comparing and retrieving relevant images.
- Use CLIP, BLIP, or LLaMA Vision
- Enables visual similarity search
- Powers image-level understanding
Vector store
Stores and searches embeddings efficiently. Links each image/text embedding with its metadata for fast, relevant retrieval.
- Stores text and image embeddings, along with image paths and other metadata
- Supports fast similarity search using cosine distance or other metrics
- Examples include:
- FAISS (lightweight, local)
- ChromaDB, Qdrant, or Weaviate (scalable or cloud-hosted)
Generator (Llama 3.2 Vision)
Takes the retrieved image and your query and generates a response by reasoning across both text and visuals.
- Accepts combined image + text input in a single prompt
- Performs multimodal reasoning to answer questions grounded in the image
- Runs locally via Ollama
Together, these parts form the backbone of the visual RAG pipeline.
Conclusion
Visual RAG pipelines bring AI closer to how we process information by combining what we see with what we read. In this guide, we built a full-stack system that takes a PDF, breaks it into images and text, indexes them with hybrid embeddings, and re-ranks results to answer visual questions using LLaMA 3.2 Vision.
To level up your skills, check out the Creating AI Applications using Retrieval-Augmented Generation (RAG) course, where you’ll build apps with Streamlit and Chroma while learning how LLMs retrieve and use external knowledge.
Frequently asked questions
1. What is LLaMA 3.2 B Vision Instruct?
It’s a multimodal version of Meta’s LLaMA 3.2 model that can understand images and text in the same prompt. The “Instruct” variant is optimized for following instructions and generating helpful responses based on visual input.
2. How do I use the LLaMA 3.2 Vision model?
You can run it locally using Ollama. After installing Ollama, run ollama pull llama3.2-vision to download the model and ollama run llama3.2-vision to interact with it. You can send images using base64-encoded strings wrapped in <image>...</image> tags.
3. How does LLaMA 3.2 compare to GPT-4?
LLaMA 3.2 Vision is strong in open-source flexibility and multimodal capabilities, especially when run locally with no cloud dependency. GPT-4 (particularly GPT-4V) often leads to accuracy and general reasoning, but it requires access to OpenAI’s APIs and is not open-source.
4. What is the image size for LLaMA 3.2 Vision?
LLaMA 3.2 Vision handles images up to 224x224 pixels effectively. While larger images can be passed, they’ll be automatically resized and center cropped.
5. Is LLaMA 3 better than ChatGPT?
It depends on the use case. LLaMA 3 offers high performance, open access, and local deployment, making it ideal for research and custom pipelines. ChatGPT (based on GPT-4) shines in general usability, tool integration, and consistent output across a wider range of queries.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
How to Build RAG Pipelines in AI Applications?
Learn what RAG pipelines are, how they work, and build one using LangChain and ChromaDB. - Article
How to Run Llama 3 Locally
Learn how to run Llama 3 locally using GPT4ALL and Ollama. Follow this step-by-step guide to set up Llama 3 for offline access, privacy, and customization. - Article
What is Retrieval-Augmented Generation (RAG) in AI?
Learn what RAG (retrieval-augmented generation) is, how it works, and build your own RAG application with LangChain and ChromaDB.
Learn more on Codecademy
- Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
- With Certificate
- Intermediate.3 hours
- Learn how to use Python to build image classification models using CNNs and vision transformers in this PyTorch tutorial.
- With Certificate
- Intermediate.5 hours
- Explore OpenAI's DALL-E and Whisper models. Generate stunning images with text prompts, transcribe and translate multilingual audio with high accuracy.
- Intermediate.1 hour