Articles

How to Build RAG Pipelines in AI Applications?

Retrieval augmented generation (RAG) is an AI technique we use to generate accurate responses to user queries by retrieving relevant documents for the query from an external database and using them as context for a large language model (LLM) to generate outputs.

This article will help you understand the basics of RAG pipelines, their components, and how they work by building a RAG pipeline using an external dataset, a vector database, and an LLM.

  • Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn about effective prompting techniques to craft high-quality prompts, maximizing your use of generative AI.
    • With Certificate
    • Beginner Friendly.
      1 hour

What is a RAG pipeline?

A RAG pipeline is the structured workflow that combines external data retrieval with an LLM’s text generation capabilities to generate context-aware responses for a given query. A RAG pipeline consists of three stages, i.e., data ingestion, retrieval, and generation. A typical RAG pipeline looks as follows:

An image showing a diagram of the RAG pipeline

A RAG pipeline has multiple components that perform specific tasks. Let’s discuss the different components of a RAG pipeline.

Components of a RAG pipeline

A RAG pipeline consists of three stages:

  • Data ingestion
  • Retrieval
  • Generation

Each stage of the RAG pipeline has different components, such as a vector database, an embedding model, an LLM, a prompt template, etc. Let’s explore the different components in the various stages of a RAG pipeline.

Data ingestion stage

In the data ingestion stage, we have a proprietary dataset, an embedding model, and a vector database.

  • Dataset: The dataset is the task-specific or domain-specific proprietary dataset that contains information to be used by the RAG pipeline while generating query responses. The documents in the dataset are split into small chunks, like paragraphs or sentences, and sent to the embedding model.
  • Embedding model: The embedding model generates vector embeddings for the text chunks in the dataset. Some popular embedding models include Google’s gemini-embedding-001 and OpenAI’s text-embedding-3-small.
  • Vector embeddings: Vector embeddings generated by the embedding model are a numerical representation of text chunks. Each vector embedding represents the semantic meaning of the text from which it was generated.
  • Vector database: The vector database stores and queries data using vector embeddings. We store the input data with vector embeddings and other metadata in the vector database.

The vector embeddings stored in the vector database are fetched by a retriever using the vector embedding of the user query.

Retrieval stage

When a user asks a query, the retrieval stage in the RAG pipeline fetches relevant documents for the query from the vector database. It has the following components:

  • Query: Query is the input the user gives to the RAG pipeline to get an answer.
  • Embedding model: The embedding model used in the retrieval stage is the same as that used in the ingestion stage. It converts the user query into a vector embedding.
  • Document retriever: A document retriever fetches relevant documents for the query from the vector database.

Once we have the relevant documents for the query, the generation stage uses the query and the documents to generate output.

Generation stage

In the generation stage of the RAG pipeline, the user query and relevant documents are combined using a prompt template. The combined prompt is then passed to the LLM to generate the output.

  • Prompt template: The prompt template combines the user query and the documents fetched by the retriever to create the final prompt for response generation.
  • LLM: The LLM can be self-hosted or API-accessible, like Gemini or OpenAI models. In the final stage of the RAG pipeline, the LLM generates output for the user query using the prompt.

With this introduction to the different components of an RAG pipeline, let’s discuss how it works.

How does a RAG pipeline work?

A RAG pipeline has several steps, including data ingestion, document retrieval, query and document augmentation, and response generation.

  • Data ingestion: We first preprocess our dataset and split it into small chunks, such as paragraphs or sentences, based on their size. Then, we generate embeddings for each chunk in the dataset and store them in the vector database. Data ingestion is an offline process, and it’s not needed in the real-time execution of the RAG pipeline. We only need it to add or update data in the database.
  • Data retrieval: When a user asks a query, we need to fetch the relevant documents from the database. To do this, the RAG pipeline first generates the vector embedding for the query. Next, it fetches similar documents from the database.
  • Combining user query and documents: After fetching relevant documents, the RAG pipeline combines them with the query to generate an input prompt for the LLM.
  • Output generation: In the final step of the RAG pipeline, the LLM uses the input prompt containing the query and the fetched documents to generate the response to the query. The response is then passed to the user through the user interface.

With this understanding of the components and working of a RAG pipeline, let’s build one using LangChain, ChromaDB, and Gemini.

Build a RAG pipeline using ChromaDB and LangChain in Python

To build the RAG pipeline, we will first install all the necessary Python modules and obtain the API keys required to access LLMs. Next, we will prepare and ingest the data into the ChromaDB vector database. Finally, we will implement the retrieval and generation stages of the RAG pipeline.

You can install all the modules required to build the RAG pipeline by executing the following command in the command-line terminal:

pip install chromadb langchain google-generativeai langchain-google-genai openai langchain-openai langchain-text-splitters langchain-chroma

We will be using Gemini or OpenAI models in our RAG pipeline. If you want to use Gemini models, you can create a Gemini API key. You can create an OpenAI API key if you want to use OpenAI models in your RAG pipeline.

After getting the API keys, you can set the environment variables in your Python file as follows:

import os
os.environ['GOOGLE_API_KEY'] = "Your_Gemini_API_Key"
os.environ['CHROMA_GOOGLE_GENAI_API_KEY']="Your_Gemini_API_Key"

If you are using ChatGPT models, you can set the environment variables using OpenAI API keys as follows:

os.environ["OPENAI_API_KEY"] = "Your_OpenAI_API_key"
os.environ['CHROMA_OPENAI_API_KEY']="Your_OpenAI_API_key"

After installing the necessary modules and setting up the environment variables, let’s preprocess the dataset.

Preprocess dataset

We will use the Bug Republic text file to build the RAG pipeline. Bug Republic is a fictional country whose data isn’t publicly known. Hence, any LLM won’t be able to answer any question related to the country. Using the information in the text file, we will use a RAG pipeline to answer questions related to Bug Republic.

In the first step, we will split the text data into small chunks of 500 characters using the following rules:

  • First, we will split the text into paragraphs.
  • If a paragraph has more than 500 characters, we will split it into sentences.
  • If a sentence has more than 500 characters, we will split the sentence into text strings of 500 characters. To maintain the semantic meaning in the incomplete chunks, we will keep 50 overlapping characters between two chunks.

To create the text chunks, we will use the RecursiveCharacterTextSplitter() function defined in the langchain.text_splitter module as follows:

from langchain.text_splitter import RecursiveCharacterTextSplitter
# Read the data from the text file
text = open("bug_republic.txt","r").read()
# Create a text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", "."]
)
# Convert the input text into chunks
chunks = splitter.split_text(text)

It is essential to convert the input text into small chunks as each embedding model has a fixed input token capacity. For instance, the gemini-embedding-001 model has a maximum input token length of 2048. Providing a text chunk longer than the maximum token length to the embedding model truncates the input text while generating vector embeddings, and we lose information.

After splitting the input text into small chunks, let’s create their vector embeddings and ingest them into the ChromaDB vector database.

Insert data into a vector database

We will insert the text chunks into the ChromaDB vector database using the following steps.

  • First, we will initialize a persistent Chroma client using the PersistentClient() function from the chromadb module. The PersistentClient() function takes the filepath of the vector database as its input and returns a client object.
  • Next, we will initialize an embedding function using the GoogleGenerativeAiEmbeddingFunction() function from the chromadb.utils.embedding_functions module. The GoogleGenerativeAiEmbeddingFunction() function takes the embedding model name we want to use as its input and returns an embedding function. If you want to use OpenAI embedding models, you can use the OpenAIEmbeddingFunction() function to initialize the embedding function.
  • After initializing the embedding function, we will create a ChromaDB collection named bug_republic to store the data. To do this, we will use the create_collection() method, which takes the collection name, the embedding function, and a dictionary containing other metadata as its input.
  • Finally, we will create an ID for each text chunk in the input data and insert them into the vector database using the add() method. The add() method takes a list of IDs for the input text chunks as input to its ids parameter and the list of text chunks as input to the documents parameter.

After executing the add() method, the text data is inserted into the bug_republic collection in the ChromaDB vector database.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
# Initialize Chroma Client
chroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")
# Initialize an embedding function
embedding_fn = GoogleGenerativeAiEmbeddingFunction(model_name="models/gemini-embedding-001")
# Use the following code to initialize an embedding function using OpenAI models
# from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# embedding_fn = OpenAIEmbeddingFunction(model_name="text-embedding-3-small")
# Create a ChromaDB collection
collection = chroma_client.create_collection(name = "bug_republic",
metadata={
"description": "Collection for storing information about Bug Republic."},
embedding_function = embedding_fn)
# Create document IDs
num_chunks=len(chunks)
ids=["doc_"+str(i) for i in range(num_chunks)]
# Add documents to the collection
collection.add(
ids=ids,
documents=chunks
)

After ingesting the data into the ChromaDB vector database, we can use it to implement the retrieval and generation stages for the RAG pipeline.

Build a retriever

We will use the bug_republic collection in the ChromaDB vector database to build a retriever in the RAG pipeline that fetches documents relevant to a given query:

  • First, we will create a client to the ChromaDB vector database using the PersistentClient() function.
  • Next, we will initialize an embedding function using the GoogleGenerativeAIEmbeddings() function from the langchain_google_genai module. Here, we must use the same embedding model we used while ingesting data into the ChromaDB vector database.
  • Next, we will build a LangChain vector store using the Chroma client and the embedding function by passing the client, embedding function, and the ChromaDB collection name as input to the Chroma() function from the langchain_chroma module.
  • Finally, we will build the retriever using the as_retriever() method of the vector store. The as_retriever() method takes a dictionary of input arguments as input to its search_kwargs parameter. We will set the input parameter k=4, which decides how many relevant documents to fetch for a given query.

We get a retriever after executing the as_retriever() method, as shown in the following code:

import chromadb
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
# Create a Chroma Client
chroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")
# Initialize an embedding function
embedding_fn = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
# Use the following code to initialize the embedding function if you are using OpenAI models
# from langchain_openai import OpenAIEmbeddings
# embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")
# Create a vector store
vectorstore = Chroma(
client = chroma_client,
collection_name = "bug_republic",
embedding_function = embedding_fn
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

After getting the retriever, we will create a document combination chain to combine the user query and the retrieved documents.

Create a document chain to combine the query and relevant documents

We will use the following steps to combine the user query and the documents fetched from the vector database.

  • First, we will initialize an LLM using the ChatGoogleGenerativeAI() function from the langchain_google_genai module. You can use the ChatOpenAI() function from the langchain_openai module if you want to use ChatGPT models.
  • Next, we will define a prompt template with placeholders for the user query and the fetched documents. To do this, we will use the ChatPromptTemplate() function from the langchain.prompts module.
  • After creating the prompt template, we will create a document combination chain using the create_stuff_documents_chain() function. This function takes the prompt template and the LLM object as its input. During the RAG pipeline’s execution, the document combination chain combines the fetched documents and the user query into the prompt template and passes it to the LLM.
  • Finally, we will pass the retriever and the document combination chain to the create_retrieval_chain() function to create the complete retrieval chain.

Using these steps, you can create the prompt template, document combination chain, and the retrieval chain as shown in the following code:

from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_google_genai import ChatGoogleGenerativeAI
# Create an LLM object
llm = ChatGoogleGenerativeAI(model="models/gemini-2.5-flash")
# Use the following code if you are using OpenAI models
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4o-mini")
# Define a prompt template
prompt_template = ChatPromptTemplate.from_template("""
Answer the question based only on the following context:
{context}
Question: {input}
""")
# Create document combination chain using LLM and prompt template
document_chain = create_stuff_documents_chain(llm, prompt_template)
# Create RAG chain using retriever and document chain
rag_chain = create_retrieval_chain(retriever, document_chain)

After creating the RAG chain, we can use it to answer user queries based on the documents in the vector database.

Generate output using the RAG pipeline

We can use the invoke() method of the RAG chain to generate outputs for user queries as follows:

query = "What is the national currency of Bug Republic?"
response = rag_chain.invoke({"input": query})
query_answer = response.get("answer")
print("The query is:")
print(query)
print("The response is:")
print(query_answer)

Output:

The query is:
What is the national currency of Bug Republic?
The response is:
The national currency of Bug Republic is the **Bugcoin (BGC)**.

As you can see, the RAG pipeline gives the correct output using the information in the text file that we ingested into the ChromaDB vector database. Thus, we have successfully built the RAG pipeline to answer queries using the input text data. Now, let’s look at some real-world applications of RAG pipelines.

Real-world use cases for RAG pipelines

RAG pipelines are useful in AI applications that need up-to-date information, facts, and proprietary data. The following are some of the use cases where RAG pipelines can be helpful:

  • Customer support: Building a customer support application requires fetching real-time updates for support tickets and orders and answering user queries based on the order or ticket status. Using RAG pipelines, we can create a customer support application that generates accurate and context-aware responses to maximize customer satisfaction.
  • Healthcare: We can build RAG pipelines using LLMs and medical literature to help healthcare professionals get evidence-based recommendations on patient symptoms, drug interactions, and treatment procedures. This can help them make informed decisions without relying on their memory.
  • Legal document analysis: Like healthcare, we can also build RAG pipelines to provide recommendations for legal issues based on past cases and government policies. This will help law professionals save time in legal research and reduce the risk of overlooking critical information.
  • Education: We can use RAG pipelines to build personal tutors who teach and resolve students’ queries by considering their performance, learning speed, and textbook information. This can help students learn easily at their own speed.
  • Enterprise knowledge management: We can build RAG pipelines using wikis, company policies, Confluence pages, product specifications, etc., to help employees easily get answers to queries about proprietary products and company policies. This will help improve productivity by making organizational knowledge easily accessible.

Conclusion

RAG pipelines combine the strengths of retrieval and generation, making AI applications more accurate and context-aware. In this article, we discussed the basics of RAG pipelines, their components, and how they work. We then implemented a RAG pipeline using LangChain, ChromaDB, and Gemini.

To learn more about RAG pipelines, you can go through this course on Creating AI Applications using Retrieval-Augmented Generation (RAG) that discusses how to build a RAG application with Streamlit and ChromaDB. You might also like this finetuning transformer models course that discusses LLM finetuning with LoRA, QLoRA, and Hugging Face.

Frequently asked questions

1. What is the difference between RAG and LLM?

An LLM is a foundational model trained on web-scale datasets and capable of generating text, audio, images, and video. However, LLMs are limited by their pre-trained knowledge. RAG enables us to build AI applications by retrieving real-time information relevant to a given query to generate outputs using the LLM.

2. Can LLM work without RAG?

Yes. An LLM is trained on web-scale historical data and works without RAG. However, RAG helps us integrate external information into AI applications requiring task-specific, domain-specific, or proprietary datasets.

3. Do you need a vector database for RAG?

No. You can build RAG pipelines even without a vector database. In that case, you can use keyword-based search methods to fetch documents for a given query. However, vector databases make the document search and retrieval process very quick. Hence, you should build RAG pipelines with vector databases if possible.

4. What are the alternatives to RAG?

To build AI applications using task-specific or domain-specific datasets, you can use knowledge augmented generation (KAG), cache augmented generation (CAG), or LLM fine-tuning instead of RAG.

5. What is the difference between RAG and CAG?

RAG pipelines dynamically retrieve external data for each query to generate output. CAG (Cache Augmented Generation) avoids frequent reads by pre-loading frequently used documents into the LLM’s extended context window (cache) for faster responses.

6. What are the steps in RAG pipeline?

A RAG pipeline has three main steps:

  1. Data ingestion - Preprocessing documents and storing embeddings in a vector database.
  2. Retrieval - Finding relevant documents for user queries using similarity search.
  3. Generation - Combining retrieved documents with the query to generate responses using an LLM.
Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn about effective prompting techniques to craft high-quality prompts, maximizing your use of generative AI.
    • With Certificate
    • Beginner Friendly.
      1 hour
  • Learn to apply continuous principles in the DevOps CI/CD pipeline, including stages, automation, containerization, SRE, and operational efficiency.
    • Beginner Friendly.
      1 hour