What is Retrieval-Augmented Generation (RAG) in AI?
What is retrieval-augmented generation(RAG) in AI?
Retrieval-augmented generation(RAG) is the process of retrieving relevant documents for a user query, augmenting the documents with the query, and passing it to a large language model(LLM) for generating accurate outputs.
Using RAG, you can build powerful domain-specific AI applications by providing LLMs with query-specific information from external databases.
In this guide, you’ll learn how RAG works and build a complete RAG application using LangChain and ChromaDB.
How does RAG work?
RAG works in three steps: retrieval, augmentation, and generation. We also need to ingest data into a vector database, which the RAG application can retrieve to answer queries. We can understand the working of a RAG application using the following diagram:
The different components in the RAG application work as follows:
- Data ingestion: We first process a proprietary dataset and split it into small chunks. Then, we create vector embeddings of the text chunks using an embedding model and store them in a vector database like ChromaDB, Pinecone, or Weaviate. This is a one-time and offline process, and needs to be done only when we want to add or update the knowledge base in the RAG application.
- Retrieval: Whenever a user asks a query from the RAG application, we convert the query into a vector embedding and fetch relevant documents for the query from the vector database using a retriever.
- Augmentation: After we have the query and the relevant documents, we combine them into a single input using a prompt template.
- Generation: We pass the combined text and query to an LLM that generates a response to the query using the fetched documents.
Using these steps, we can implement RAG applications using any vector database, a proprietary dataset, an embedding model, and an LLM. We will implement a RAG application using LangChain, ChromaDB, and the Gemini API or the OpenAI API. Let’s first install the required modules and set up environment variables.
RAG application setup and installation
We will use the ChromaDB vector database to store the data, Gemini or OpenAI to create embeddings and access an LLM, and LangChain to build the RAG application. You can install the required modules to set up all these frameworks using PIP as follows:
pip install chromadb langchain google-generativeai langchain-google-genai openai langchain-openai langchain-text-splitters langchain-chroma
If you want to use Gemini, you can create a Gemini API key and set the environment variables as follows:
import osos.environ['GOOGLE_API_KEY'] = "Your_Gemini_API_Key"os.environ['CHROMA_GOOGLE_GENAI_API_KEY']="Your_Gemini_API_Key "
The GOOGLE_API_KEY API key is required to use Gemini models inside LangChain. The CHROMA_GOOGLE_GENAI_API_KEY API key is needed to generate embeddings while ingesting data into ChromaDB.
If you are using OpenAI models, you can create an OpenAI API key and set the environment variables as follows:
import osos.environ["OPENAI_API_KEY"] = "your_OpenAI_api_key"os.environ['CHROMA_OPENAI_API_KEY']=" your_OpenAI_api_key"
After installation and setup, let’s discuss building a RAG application using LangChain, ChromaDB, and Gemini in Python.
Building a RAG application using LangChain in Python
Suppose we have to build a RAG application that answers queries about a fictional country, “United States of Codecademy (USC)”. As there is no information about this country online, any LLM won’t be able to answer queries related to this. For example, let’s ask Gemini the population of USC.
from langchain_google_genai import ChatGoogleGenerativeAIllm = ChatGoogleGenerativeAI(model="models/gemini-2.0-flash-lite")# Use the following code to use ChatGPT models# from langchain_openai import ChatOpenAI# llm = ChatOpenAI(model="gpt-4o-mini")query = "What is the population of the United States of Codecademy?"response = llm.invoke(query)print("The query is:")print(query)print("The LLM response is:")print(response.content)
Output:
The query is:
What is the population of the United States of Codecademy?
The LLM response is:
There is no official population for a "United States of Codecademy." Codecademy is an online learning platform, not a country or a physical location.
As you can see, Gemini has no information regarding USC’s population. In this case, we can build a RAG application and provide all the information about USC to the LLM. Then, it can handle all the queries. Let’s discuss how to do this step-wise.
Step 1: Prepare dataset
We will use the USC details text file to get the text containing the information related to the United States of Codecademy.
Next, we need to create text chunks from the dataset to generate embeddings for each chunk and store them in a vector database. Chunking is important as each embedding generation model has a fixed input token capacity. For example, the gemini-embedding-001 model we will use to create vector embeddings has a maximum input token length of 2048. If we provide text longer than the maximum token length to the embedding model, the text is truncated, and we lose information. So, we will divide the entire text into meaningful chunks. For this, we will use the following strategy:
- First, we will divide the text into paragraphs with lengths less than 500 characters(around 100 words).
- If a paragraph has more than 500 characters, we will split it into sentences.
- If a sentence has more than 500 characters, we will split the text into 500-character chunks. However, these chunks will be incomplete. So, we will select 50 (ten percent of chunk length) overlapping characters in each chunk so that the chunks from the same sentences maintain their semantic similarity.
To create chunks from the input text, we will use the RecursiveCharacterTextSplitter() function defined in the langchain.text_splitter module as follows:
text = open("United_States_of_Codecademy.txt","r").read()from langchain.text_splitter import RecursiveCharacterTextSplittersplitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50,separators=["\n\n", "\n", "."])chunks = splitter.split_text(text)
After creating text chunks from the input data, we will ingest the data into the ChromaDB vector database.
Step 2: Ingest data into a vector database
To store the data in the ChromaDB vector database, we will first create a Chroma client using the PersistentClient() function defined in the chromadb module. The PersistentClient() function takes the file path where we want to keep the ChromaDB database as an input to its path parameter and returns a client object.
import chromadbchroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")
After creating the ChromaDB client, we will create a collection to store the embeddings for the text data related to USC. To create the embeddings, we will define an embedding function using the GoogleGenerativeAiEmbeddingFunction() method as follows:
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunctionembedding_fn = GoogleGenerativeAiEmbeddingFunction(model_name="models/gemini-embedding-001")
If you are using the OpenAI API, you can initialize the embedding function as follows:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunctionembedding_fn = OpenAIEmbeddingFunction(model_name="text-embedding-3-small")
Next, we will create a collection united_states_of_codecademy using the create_collection() method. The create_collection() method takes the collection name, description, and the embedding function as input arguments and creates the collection in the database. After execution, it returns a collection object that we can use to perform read and write operations on the database collection.
collection = chroma_client.create_collection(name = "united_states_of_codecademy",metadata={"description": "Collection for storing information about the United States of Codecademy"},embedding_function = embedding_fn)
After creating the collection, we will add the text to the united_states_of_codecademy collection using the add() method. The add() method takes a list of IDs for the input text chunks as input to its ids parameter and a list of text chunks as input to the documents parameter.
We will create a list of IDs for the input chunks and insert them into the ChromaDB collection as follows:
# Create document IDsnum_chunks=len(chunks)ids=["chunk"+str(i) for i in range(num_chunks)]# Add documents to the collectioncollection.add(ids=ids,documents=chunks)
After adding the text into the ChromaDB collection, you can have a look at the first few documents in the collection using the peek() method:
print("The first three documents in the collection are:")print(collection.peek(3))
Output:
The first three documents in the collection are:{'ids': ['chunk0', 'chunk1', 'chunk2'], 'embeddings': array([[ 0.02023244, 0.01668288, 0.00134622, ..., -0.01057068,-0.00448969, -0.00029907],[ 0.01078238, 0.0278141 , 0.00588628, ..., -0.00812627,0.01997375, 0.00649342],[ 0.0178372 , 0.01736952, 0.00395358, ..., -0.02575691,0.00984182, 0.00612617]]), 'documents': ['# United States of Codecademy\n\nThe **United States of Codecademy (USC)** is a fictional federal republic located in the digital continent of **Learnia**. Known globally as the "Nation of Coders," the USC is a technologically advanced country whose economy, culture, and politics revolve around programming, digital education, and innovation in artificial intelligence. It is widely regarded as the world’s hub for software engineering, data science, and online learning.\n\n## Etymology', '## Etymology\n\nThe name **Codecademy** is derived from the words *code* and *academy*, reflecting the country’s foundational principle of merging practical software development with structured learning. The phrase “United States” represents the federation of multiple provinces, each one specializing in a different branch of computer science and contributing to the digital unity of the nation.\n\n## History', 'The history of the United States of Codecademy begins with the **Great Syntax Revolution** of the early 21st century. During this period, programmers from across Learnia grew dissatisfied with proprietary restrictions and closed knowledge systems. They envisioned a nation where education, collaboration, and code could flourish without barriers'], 'uris': None, 'included': ['metadatas', 'documents', 'embeddings'], 'data': None, 'metadatas': [None, None, None]}
Now that we have the text data ingested in ChromaDB, we can build a retriever that fetches relevant text chunks from the database for any given query.
Step 3: Build a retriever to fetch relevant data for a given query
Retrieval is the first step in the real-time working of a RAG application. To build a retriever, we will first initialize a LangChain Chroma vector store using the Chroma() function defined in the langchain_chroma module. The Chroma() function takes a ChromaDB client, the collection name, and an embedding function as inputs to its client, collection_name, and embedding_function parameters.
We will initialize a ChromaDB client using the chromadb.PersistentClient() function, initialize an embedding function using the GoogleGenerativeAIEmbeddings() function defined in the langchain_google_genai module, and initialize the Chroma vector store as follows:
import chromadbfrom langchain_chroma import Chromafrom langchain_google_genai import GoogleGenerativeAIEmbeddings# Create a ChromaDB clientchroma_client = chromadb.PersistentClient(path="/home/aditya1117/codes/codecademy_resources/chromadb")# Initialize an embedding functionembedding_fn = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")# Use the following code to initialize the embedding function if you are using OpenAI models# from langchain_openai import OpenAIEmbeddings# embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")# Initialize a LangChain vector storevectorstore = Chroma(client = chroma_client,collection_name = "united_states_of_codecademy",embedding_function = embedding_fn)
When initializing the vector store, it is important to use the same embedding model to initialize the embedding function that we used while ingesting data into the ChromaDB vector database.
After initializing the vectorstore, we can build a retriever using the as_retriever() method of the vectorstore object. The as_retriever() method takes a dictionary of input arguments as input to its search_kwargs parameter. We will set the input parameter k=3, which decides how many relevant documents to fetch for a given query, and create a retriever.
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
After execution, the as_retriever() method returns a retriever we can use in our RAG application to fetch documents for a given query.
Step 4: Augmenting retrieved documents with query
We will augment the retrieved documents with the query using a prompt template, a document combination chain, and a retrieval chain.
- First, we will define a prompt template that uses the fetched documents as context and instructs the LLM to respond to the query as per the context.
- Next, we will create a document combination chain using the
create_stuff_documents_chain()function. Thecreate_stuff_documents_chain()function takes a prompt template and an LLM object as its input. During execution, it concatenates all the fetched documents and the query into a prompt template and passes it to the LLM. - Finally, we will pass the retriever and the document combination chain to the
create_retrieval_chain()function to get the RAG application.
from langchain.prompts import ChatPromptTemplatefrom langchain.chains.combine_documents import create_stuff_documents_chainfrom langchain.chains import create_retrieval_chainfrom langchain_google_genai import ChatGoogleGenerativeAI# Initialize an LLM objectllm = ChatGoogleGenerativeAI(model="models/gemini-2.0-flash-lite")# Use the following code if you are using OpenAI models# from langchain_openai import ChatOpenAI# llm = ChatOpenAI(model="gpt-4o-mini")# Define a prompt templateprompt_template = ChatPromptTemplate.from_template("""Answer the question based only on the following context:{context}Question: {input}""")# Create a document combination chain using the LLM and the prompt templatedocument_chain = create_stuff_documents_chain(llm, prompt_template)# Create a RAG application using the retriever and the document chainrag_app = create_retrieval_chain(retriever, document_chain)
The create_retrieval_chain() function takes the retriever and the document combination chain as its input. It returns an end-to-end RAG application that we can use to answer queries based on underlying information in the database.
Step 5: Generate a response from the RAG application for the query
To generate a response for a given query, you can pass the query to the RAG application using the invoke() method, as follows:
query = "What is the population of the United States of Codecademy?"response = rag_app.invoke({"input": query})query_answer = response.get("answer")print("The query is:")print(query)print("The RAG app's response is:")print(query_answer)
Output:
The query is:What is the population of the United States of Codecademy?The RAG app's response is:The population of the United States of Codecademy is estimated at around 150 million registered learners.
Every time we pass a query to the RAG application using the invoke() method, it fetches relevant documents for the given query and generates a response based on the underlying documents. You can also look at the documents that the RAG application uses to generate the response. To do this, you can use the context attribute of the response:
documents= response.get("context")print("The documents used for response generation are:")print(documents)
Output:
The input query is:What is the population of United states of Codecademy?The documents retrieved from the database are:[Document(id='chunk18', metadata={}, page_content='As of 2025, the population of the United States of Codecademy is estimated at around 150 million registered learners. The official languages of the nation are **Python, JavaScript, and SQL**, though Java, C++, and R are also widely spoken as secondary languages. The society is deeply shaped by its philosophical and spiritual movements, the most prominent being **Open Sourceism**, which promotes the belief that knowledge should be freely shared for the advancement of all'), Document(id='chunk0', metadata={}, page_content='# United States of Codecademy\n\nThe **United States of Codecademy (USC)** is a fictional federal republic located in the digital continent of **Learnia**. Known globally as the "Nation of Coders," the USC is a technologically advanced country whose economy, culture, and politics revolve around programming, digital education, and innovation in artificial intelligence. It is widely regarded as the world’s hub for software engineering, data science, and online learning.\n\n## Etymology'), Document(id='chunk12', metadata={}, page_content='The United States of Codecademy operates as a **federal digital democracy**. At its helm is the **Chief Coder (CC)**, who serves as head of state and government and is elected every four years through a nationwide online election. The legislative branch, known as the **Parliament of Programmers (PoP)**, is composed of elected representatives from each of the twelve provinces, tasked with drafting laws, approving new learning modules, and allocating national resources')]The model output is:The population of the United States of Codecademy is estimated at around 150 million registered learners.
Owing to the external data access, RAG applications provide many advantages compared to a standalone LLM. Let’s discuss some of these advantages.
Advantages of retrieval-augmented generation apps
RAG provides us with many advantages while building AI applications:
- Improved factual accuracy: A RAG application first pulls information from a knowledge base for any given query and then generates the response. Thus, RAG reduces hallucinations and improves the quality of the LLM output.
- Reduced costs: With RAG, we don’t need to deploy massive LLMs with billions of parameters. Since knowledge is not entirely stored in the model’s parameters, we can use smaller models with an external retrieval mechanism to generate accurate responses while keeping the computational costs and memory overhead in check.
- Adaptability: RAG helps us adapt an LLM to any domain without retraining or fine-tuning. We can plug a task-specific or domain-specific knowledge base into the RAG application to generate accurate responses. We can also use the same LLM for different tasks, which helps us reduce the infrastructure and compute costs while deploying LLM applications.
- Explainability and transparency: RAG applications allow us to see the text or paragraph that they use to generate responses for a given query. Using the document, we can trace the source of the information. This transparency increases trust in the RAG application, which is crucial for building AI applications for medical, legal, and finance domains.
- Dynamic updates to AI applications: We can update the knowledge base of a RAG application independently and adapt the AI applications to new information without any downtime or LLM fine-tuning. This helps us adopt LLM applications in tools like CRM applications, business catalog, and billing software that require up-to-date information to work correctly.
Along with their advantages, RAG applications also pose some technical and operational challenges. Let’s discuss some of their disadvantages.
Challenges of retrieval-augmented generation apps
You might face the following challenges while building and deploying RAG applications.
- Latency: For every query, a RAG application needs to retrieve information from a vector database, which increases the response time. The latency introduced due to retrieval can become a bottleneck for real-time applications.
- Scalability: As the knowledge base grows in size, the retrieval step becomes computationally expensive and harder to optimize. To improve the retrieval process, we need to use techniques like indexing, approximate nearest-neighbor(ANN) search, and Facebook AI similarity search (FAISS).
- Knowledge base management: The output quality of a RAG application depends on the quality of the information provided. Hence, we need to maintain and update high-quality information in the knowledge base. Poorly structured or outdated information can result in incorrect LLM outputs, leading to business or reputation loss.
- Retrieval quality dependency: If the retriever in the RAG application fails to fetch relevant information for a given query, the LLM output will be of low quality. Hence, we need to design retrievers that can retrieve the most relevant information from the knowledge base for any given query.
Conclusion
Retrieval augmented generation (RAG) offers a practical way to build more innovative AI applications by combining knowledge retrieval with LLMs. While it has specific challenges, its accuracy and efficiency advantages make it a powerful approach for real-world use cases.
In this article, we discussed building a RAG application using ChromaDB, LangChain, and Gemini. To learn more about building AI applications using RAG, you can go through this course on Creating AI Applications using Retrieval-Augmented Generation (RAG) that discusses how to build a RAG app with Streamlit and ChromaDB. You might also like this finetuning transformer models course that discusses LLM finetuning with LoRA, QLoRA, and Hugging Face.
Frequently asked questions
1. What is the difference between RAG and MCP?
RAG helps us build AI applications by providing external data access to LLMs to generate accurate and relevant outputs. On the other hand, Model Context Protocol (MCP) is an open standard designed to enable LLMs to integrate with external APIs, tools, and databases, and facilitates a standardized interface and seamless interaction between the LLM and external components.
2. What are the two main components of RAG?
The retriever and the LLM(generator) are the two main components of a RAG application. The retriever finds the information relevant to a query from an external database, and the generator uses the data to generate accurate outputs for any given query.
3. What database to use for RAG?
You can build a RAG application with any vector database like ChromaDB, Pinecone, or Weaviate. Nowadays, relational and graph databases like Redis, Apache Cassandra, Elasticsearch, MySQL, MongoDB, Neo4j, etc., have also introduced vector search functionalities, and you can use these databases in your RAG applications.
4. What is the difference between RAG and traditional ML?
Traditional machine learning (ML) models rely solely on the patterns learned from the training dataset. On the other hand, RAG applications have external databases that they can use to generate outputs. Also, traditional ML models require a costly and time-consuming retraining process to adapt to a new dataset. On the contrary, we can adapt RAG applications to new domains or tasks by updating the external knowledge base without touching the LLM in the RAG application.
5. What is the difference between RAG and SLM?
Retrieval-augmented generation (RAG) is a technique to improve an LLM’s response by providing external information relevant to a given query. A Small Language Model(SLM) is a transformer model that is smaller, more efficient, and specialized for specific tasks compared to an LLM.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
How to Build RAG Pipelines in AI Applications?
Learn what RAG pipelines are, how they work, and build one using LangChain and ChromaDB. - Article
What is a Vector Database in AI? How They Work
Learn what vector databases are, how they store embeddings, and use the ChromaDB vector database for AI search and retrieval applications. - Article
Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama
Explore how to build multimodal RAG pipelines using LLaMA 3.2 Vision and Ollama for intelligent document understanding and visual question answering.
Learn more on Codecademy
- Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
- With Certificate
- Intermediate.3 hours
- Learn about effective prompting techniques to craft high-quality prompts, maximizing your use of generative AI.
- With Certificate
- Beginner Friendly.1 hour
- Explore fine-tuning AI models like GPT-3 and 4 with OpenAI APIs. Learn to utilize the Assistants API and understand the creation and comparison of text embeddings.
- Intermediate.1 hour