How to Build RAG Pipelines in AI Applications?
Retrieval augmented generation (RAG) is an AI technique we use to generate accurate responses to user queries by retrieving relevant documents for the query from an external database and using them as context for a large language model (LLM) to generate outputs.
This article will help you understand the basics of RAG pipelines, their components, and how they work by building a RAG pipeline using an external dataset, a vector database, and an LLM.
What is a RAG pipeline?
A RAG pipeline is the structured workflow that combines external data retrieval with an LLM’s text generation capabilities to generate context-aware responses for a given query. A RAG pipeline consists of three stages, i.e., data ingestion, retrieval, and generation. A typical RAG pipeline looks as follows:
A RAG pipeline has multiple components that perform specific tasks. Let’s discuss the different components of a RAG pipeline.
Components of a RAG pipeline
A RAG pipeline consists of three stages:
- Data ingestion
- Retrieval
- Generation
Each stage of the RAG pipeline has different components, such as a vector database, an embedding model, an LLM, a prompt template, etc. Let’s explore the different components in the various stages of a RAG pipeline.
Data ingestion stage
In the data ingestion stage, we have a proprietary dataset, an embedding model, and a vector database.
- Dataset: The dataset is the task-specific or domain-specific proprietary dataset that contains information to be used by the RAG pipeline while generating query responses. The documents in the dataset are split into small chunks, like paragraphs or sentences, and sent to the embedding model.
- Embedding model: The embedding model generates vector embeddings for the text chunks in the dataset. Some popular embedding models include Google’s
gemini-embedding-001and OpenAI’stext-embedding-3-small. - Vector embeddings: Vector embeddings generated by the embedding model are a numerical representation of text chunks. Each vector embedding represents the semantic meaning of the text from which it was generated.
- Vector database: The vector database stores and queries data using vector embeddings. We store the input data with vector embeddings and other metadata in the vector database.
The vector embeddings stored in the vector database are fetched by a retriever using the vector embedding of the user query.
Retrieval stage
When a user asks a query, the retrieval stage in the RAG pipeline fetches relevant documents for the query from the vector database. It has the following components:
- Query: Query is the input the user gives to the RAG pipeline to get an answer.
- Embedding model: The embedding model used in the retrieval stage is the same as that used in the ingestion stage. It converts the user query into a vector embedding.
- Document retriever: A document retriever fetches relevant documents for the query from the vector database.
Once we have the relevant documents for the query, the generation stage uses the query and the documents to generate output.
Generation stage
In the generation stage of the RAG pipeline, the user query and relevant documents are combined using a prompt template. The combined prompt is then passed to the LLM to generate the output.
- Prompt template: The prompt template combines the user query and the documents fetched by the retriever to create the final prompt for response generation.
- LLM: The LLM can be self-hosted or API-accessible, like Gemini or OpenAI models. In the final stage of the RAG pipeline, the LLM generates output for the user query using the prompt.
With this introduction to the different components of an RAG pipeline, let’s discuss how it works.
How does a RAG pipeline work?
A RAG pipeline has several steps, including data ingestion, document retrieval, query and document augmentation, and response generation.
- Data ingestion: We first preprocess our dataset and split it into small chunks, such as paragraphs or sentences, based on their size. Then, we generate embeddings for each chunk in the dataset and store them in the vector database. Data ingestion is an offline process, and it’s not needed in the real-time execution of the RAG pipeline. We only need it to add or update data in the database.
- Data retrieval: When a user asks a query, we need to fetch the relevant documents from the database. To do this, the RAG pipeline first generates the vector embedding for the query. Next, it fetches similar documents from the database.
- Combining user query and documents: After fetching relevant documents, the RAG pipeline combines them with the query to generate an input prompt for the LLM.
- Output generation: In the final step of the RAG pipeline, the LLM uses the input prompt containing the query and the fetched documents to generate the response to the query. The response is then passed to the user through the user interface.
With this understanding of the components and working of a RAG pipeline, let’s build one using LangChain, ChromaDB, and Gemini.
Build a RAG pipeline using ChromaDB and LangChain in Python
To build the RAG pipeline, we will first install all the necessary Python modules and obtain the API keys required to access LLMs. Next, we will prepare and ingest the data into the ChromaDB vector database. Finally, we will implement the retrieval and generation stages of the RAG pipeline.
You can install all the modules required to build the RAG pipeline by executing the following command in the command-line terminal:
pip install chromadb langchain google-generativeai langchain-google-genai openai langchain-openai langchain-text-splitters langchain-chroma
We will be using Gemini or OpenAI models in our RAG pipeline. If you want to use Gemini models, you can create a Gemini API key. You can create an OpenAI API key if you want to use OpenAI models in your RAG pipeline.
After getting the API keys, you can set the environment variables in your Python file as follows:
import osos.environ['GOOGLE_API_KEY'] = "Your_Gemini_API_Key"os.environ['CHROMA_GOOGLE_GENAI_API_KEY']="Your_Gemini_API_Key"
If you are using ChatGPT models, you can set the environment variables using OpenAI API keys as follows:
os.environ["OPENAI_API_KEY"] = "Your_OpenAI_API_key"os.environ['CHROMA_OPENAI_API_KEY']="Your_OpenAI_API_key"
After installing the necessary modules and setting up the environment variables, let’s preprocess the dataset.
Preprocess dataset
We will use the Bug Republic text file to build the RAG pipeline. Bug Republic is a fictional country whose data isn’t publicly known. Hence, any LLM won’t be able to answer any question related to the country. Using the information in the text file, we will use a RAG pipeline to answer questions related to Bug Republic.
In the first step, we will split the text data into small chunks of 500 characters using the following rules:
- First, we will split the text into paragraphs.
- If a paragraph has more than 500 characters, we will split it into sentences.
- If a sentence has more than 500 characters, we will split the sentence into text strings of 500 characters. To maintain the semantic meaning in the incomplete chunks, we will keep 50 overlapping characters between two chunks.
To create the text chunks, we will use the RecursiveCharacterTextSplitter() function defined in the langchain.text_splitter module as follows:
from langchain.text_splitter import RecursiveCharacterTextSplitter# Read the data from the text filetext = open("bug_republic.txt","r").read()# Create a text splittersplitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50,separators=["\n\n", "\n", "."])# Convert the input text into chunkschunks = splitter.split_text(text)
It is essential to convert the input text into small chunks as each embedding model has a fixed input token capacity. For instance, the gemini-embedding-001 model has a maximum input token length of 2048. Providing a text chunk longer than the maximum token length to the embedding model truncates the input text while generating vector embeddings, and we lose information.
After splitting the input text into small chunks, let’s create their vector embeddings and ingest them into the ChromaDB vector database.
Insert data into a vector database
We will insert the text chunks into the ChromaDB vector database using the following steps.
- First, we will initialize a persistent Chroma client using the
PersistentClient()function from thechromadbmodule. ThePersistentClient()function takes the filepath of the vector database as its input and returns a client object. - Next, we will initialize an embedding function using the
GoogleGenerativeAiEmbeddingFunction()function from thechromadb.utils.embedding_functionsmodule. TheGoogleGenerativeAiEmbeddingFunction()function takes the embedding model name we want to use as its input and returns an embedding function. If you want to use OpenAI embedding models, you can use theOpenAIEmbeddingFunction()function to initialize the embedding function. - After initializing the embedding function, we will create a ChromaDB collection named
bug_republicto store the data. To do this, we will use thecreate_collection()method, which takes the collection name, the embedding function, and a dictionary containing other metadata as its input. - Finally, we will create an ID for each text chunk in the input data and insert them into the vector database using the
add()method. Theadd()method takes a list of IDs for the input text chunks as input to itsidsparameter and the list of text chunks as input to thedocumentsparameter.
After executing the add() method, the text data is inserted into the bug_republic collection in the ChromaDB vector database.
import chromadbfrom chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction# Initialize Chroma Clientchroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")# Initialize an embedding functionembedding_fn = GoogleGenerativeAiEmbeddingFunction(model_name="models/gemini-embedding-001")# Use the following code to initialize an embedding function using OpenAI models# from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction# embedding_fn = OpenAIEmbeddingFunction(model_name="text-embedding-3-small")# Create a ChromaDB collectioncollection = chroma_client.create_collection(name = "bug_republic",metadata={"description": "Collection for storing information about Bug Republic."},embedding_function = embedding_fn)# Create document IDsnum_chunks=len(chunks)ids=["doc_"+str(i) for i in range(num_chunks)]# Add documents to the collectioncollection.add(ids=ids,documents=chunks)
After ingesting the data into the ChromaDB vector database, we can use it to implement the retrieval and generation stages for the RAG pipeline.
Build a retriever
We will use the bug_republic collection in the ChromaDB vector database to build a retriever in the RAG pipeline that fetches documents relevant to a given query:
- First, we will create a client to the ChromaDB vector database using the
PersistentClient()function. - Next, we will initialize an embedding function using the
GoogleGenerativeAIEmbeddings()function from thelangchain_google_genaimodule. Here, we must use the same embedding model we used while ingesting data into the ChromaDB vector database. - Next, we will build a LangChain vector store using the Chroma client and the embedding function by passing the client, embedding function, and the ChromaDB collection name as input to the
Chroma()function from thelangchain_chromamodule. - Finally, we will build the retriever using the
as_retriever()method of the vector store. Theas_retriever()method takes a dictionary of input arguments as input to itssearch_kwargsparameter. We will set the input parameterk=4, which decides how many relevant documents to fetch for a given query.
We get a retriever after executing the as_retriever() method, as shown in the following code:
import chromadbfrom langchain_chroma import Chromafrom langchain_google_genai import GoogleGenerativeAIEmbeddings# Create a Chroma Clientchroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")# Initialize an embedding functionembedding_fn = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")# Use the following code to initialize the embedding function if you are using OpenAI models# from langchain_openai import OpenAIEmbeddings# embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")# Create a vector storevectorstore = Chroma(client = chroma_client,collection_name = "bug_republic",embedding_function = embedding_fn)retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
After getting the retriever, we will create a document combination chain to combine the user query and the retrieved documents.
Create a document chain to combine the query and relevant documents
We will use the following steps to combine the user query and the documents fetched from the vector database.
- First, we will initialize an LLM using the
ChatGoogleGenerativeAI()function from thelangchain_google_genaimodule. You can use theChatOpenAI()function from thelangchain_openaimodule if you want to use ChatGPT models. - Next, we will define a prompt template with placeholders for the user query and the fetched documents. To do this, we will use the
ChatPromptTemplate()function from thelangchain.promptsmodule. - After creating the prompt template, we will create a document combination chain using the
create_stuff_documents_chain()function. This function takes the prompt template and the LLM object as its input. During the RAG pipeline’s execution, the document combination chain combines the fetched documents and the user query into the prompt template and passes it to the LLM. - Finally, we will pass the retriever and the document combination chain to the
create_retrieval_chain()function to create the complete retrieval chain.
Using these steps, you can create the prompt template, document combination chain, and the retrieval chain as shown in the following code:
from langchain.prompts import ChatPromptTemplatefrom langchain.chains.combine_documents import create_stuff_documents_chainfrom langchain.chains import create_retrieval_chainfrom langchain_google_genai import ChatGoogleGenerativeAI# Create an LLM objectllm = ChatGoogleGenerativeAI(model="models/gemini-2.5-flash")# Use the following code if you are using OpenAI models# from langchain_openai import ChatOpenAI# llm = ChatOpenAI(model="gpt-4o-mini")# Define a prompt templateprompt_template = ChatPromptTemplate.from_template("""Answer the question based only on the following context:{context}Question: {input}""")# Create document combination chain using LLM and prompt templatedocument_chain = create_stuff_documents_chain(llm, prompt_template)# Create RAG chain using retriever and document chainrag_chain = create_retrieval_chain(retriever, document_chain)
After creating the RAG chain, we can use it to answer user queries based on the documents in the vector database.
Generate output using the RAG pipeline
We can use the invoke() method of the RAG chain to generate outputs for user queries as follows:
query = "What is the national currency of Bug Republic?"response = rag_chain.invoke({"input": query})query_answer = response.get("answer")print("The query is:")print(query)print("The response is:")print(query_answer)
Output:
The query is:What is the national currency of Bug Republic?The response is:The national currency of Bug Republic is the **Bugcoin (BGC)**.
As you can see, the RAG pipeline gives the correct output using the information in the text file that we ingested into the ChromaDB vector database. Thus, we have successfully built the RAG pipeline to answer queries using the input text data. Now, let’s look at some real-world applications of RAG pipelines.
Real-world use cases for RAG pipelines
RAG pipelines are useful in AI applications that need up-to-date information, facts, and proprietary data. The following are some of the use cases where RAG pipelines can be helpful:
- Customer support: Building a customer support application requires fetching real-time updates for support tickets and orders and answering user queries based on the order or ticket status. Using RAG pipelines, we can create a customer support application that generates accurate and context-aware responses to maximize customer satisfaction.
- Healthcare: We can build RAG pipelines using LLMs and medical literature to help healthcare professionals get evidence-based recommendations on patient symptoms, drug interactions, and treatment procedures. This can help them make informed decisions without relying on their memory.
- Legal document analysis: Like healthcare, we can also build RAG pipelines to provide recommendations for legal issues based on past cases and government policies. This will help law professionals save time in legal research and reduce the risk of overlooking critical information.
- Education: We can use RAG pipelines to build personal tutors who teach and resolve students’ queries by considering their performance, learning speed, and textbook information. This can help students learn easily at their own speed.
- Enterprise knowledge management: We can build RAG pipelines using wikis, company policies, Confluence pages, product specifications, etc., to help employees easily get answers to queries about proprietary products and company policies. This will help improve productivity by making organizational knowledge easily accessible.
Conclusion
RAG pipelines combine the strengths of retrieval and generation, making AI applications more accurate and context-aware. In this article, we discussed the basics of RAG pipelines, their components, and how they work. We then implemented a RAG pipeline using LangChain, ChromaDB, and Gemini.
To learn more about RAG pipelines, you can go through this course on Creating AI Applications using Retrieval-Augmented Generation (RAG) that discusses how to build a RAG application with Streamlit and ChromaDB. You might also like this finetuning transformer models course that discusses LLM finetuning with LoRA, QLoRA, and Hugging Face.
Frequently asked questions
1. What is the difference between RAG and LLM?
An LLM is a foundational model trained on web-scale datasets and capable of generating text, audio, images, and video. However, LLMs are limited by their pre-trained knowledge. RAG enables us to build AI applications by retrieving real-time information relevant to a given query to generate outputs using the LLM.
2. Can LLM work without RAG?
Yes. An LLM is trained on web-scale historical data and works without RAG. However, RAG helps us integrate external information into AI applications requiring task-specific, domain-specific, or proprietary datasets.
3. Do you need a vector database for RAG?
No. You can build RAG pipelines even without a vector database. In that case, you can use keyword-based search methods to fetch documents for a given query. However, vector databases make the document search and retrieval process very quick. Hence, you should build RAG pipelines with vector databases if possible.
4. What are the alternatives to RAG?
To build AI applications using task-specific or domain-specific datasets, you can use knowledge augmented generation (KAG), cache augmented generation (CAG), or LLM fine-tuning instead of RAG.
5. What is the difference between RAG and CAG?
RAG pipelines dynamically retrieve external data for each query to generate output. CAG (Cache Augmented Generation) avoids frequent reads by pre-loading frequently used documents into the LLM’s extended context window (cache) for faster responses.
6. What are the steps in RAG pipeline?
A RAG pipeline has three main steps:
- Data ingestion - Preprocessing documents and storing embeddings in a vector database.
- Retrieval - Finding relevant documents for user queries using similarity search.
- Generation - Combining retrieved documents with the query to generate responses using an LLM.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
What is Retrieval-Augmented Generation (RAG) in AI?
Learn what RAG (retrieval-augmented generation) is, how it works, and build your own RAG application with LangChain and ChromaDB. - Article
What is a Vector Database in AI? How They Work
Learn what vector databases are, how they store embeddings, and use the ChromaDB vector database for AI search and retrieval applications. - Article
Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama
Explore how to build multimodal RAG pipelines using LLaMA 3.2 Vision and Ollama for intelligent document understanding and visual question answering.
Learn more on Codecademy
- Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
- With Certificate
- Intermediate.3 hours
- Learn about effective prompting techniques to craft high-quality prompts, maximizing your use of generative AI.
- With Certificate
- Beginner Friendly.1 hour
- Learn to apply continuous principles in the DevOps CI/CD pipeline, including stages, automation, containerization, SRE, and operational efficiency.
- Beginner Friendly.1 hour