Articles

What is a Vector Database in AI? How They Work

This article discusses the basics of vector databases, their working, and applications through an example using ChromaDB.

What you’ll learn:

  1. Basics of vector databases and how they work.
  2. How to install and use the Chroma vector database to store and query text data.
  3. Applications of vector databases.
  4. Details about some popular vector databases and text embedding models.
  • Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn how to query SQL databases and design relational databases to efficiently store large quantities of data.
    • Includes 5 Courses
    • With Certificate
    • Beginner Friendly.
      13 hours

What is a vector database?

A vector database is a database that stores data using vector embeddings. The vector embeddings are high-dimensional numerical arrays created from text, image, audio, or video data using deep learning algorithms and contain the semantic meaning of the input data. A vector database is optimized for storing, indexing, and searching high-dimensional vector embeddings.

We use vector databases to build artificial intelligence (AI) applications with retrieval-augmented generation (RAG), semantic search applications like search engines and document retrievers, and recommendation systems. Examples of vector databases include Chroma (ChromaDB), Pinecone, Weaviate, and FAISS.

To understand vector databases better, let’s discuss how they store, index, and query data.

How do vector databases work?

Vector databases store data as vector embeddings. When we query the database using a given text, the database first embeds the query into a vector. Then, it searches for similar vectors using metrics like cosine similarity or Euclidean distance and provides the most similar vectors and the associated data as its output. We can understand how this entire process works using four operations, i.e., data ingestion, storage, indexing, and querying.

1. Data ingestion

The data ingested into a vector database is stored as embeddings. Each embedding in a vector database has a finite length, and each element in the embedding vector represents a particular property or feature. When we ingest data into the vector database, it is converted into a vector embedding with some inherent meaning.

For example, assume that a vector database stores data using embedding vectors of length five. Then, it may store the strings “Book”, “Website”, “Blog”, and “Potato” as follows:

Book -> [1, 0.8, 0.3, 0.9, 0.6]
Website -> [0.0, 0.7, 0.5, 0.6, 0.7]
Blog -> [0.0, 0.5, 0.4, 0.2, 0.7]
Potato -> [1, 0.0, 0.01, 0.05, 0.5]

Here, each element of a vector embedding has a semantic meaning. The first element might represent physicality, the second element might represent information density, and the third, fourth, and fifth elements might represent interactivity, length, and accessibility.

  • A “Book” is physical, rich in information, low in interactivity, long, and moderately accessible. Hence, it is embedded using the vector [1, 0.8, 0.1, 0.9, 0.6].
  • A “Website” is virtual, less information-dense, more interactive, moderately long, and highly accessible. Hence, it is stored as the embedding vector [0.0, 0.7, 0.5, 0.5, 0.7].
  • A “Potato” is physical and easy to access, but lacks information and interactivity. So, it is embedded as [1, 0, 0.01, 0.05, 0.5].

“Website” and “Blog” have similar embeddings because they are semantically related, while “Potato” has a completely different embedding pattern, showing it’s unrelated to other entities.

Similar to these entities, we can embed sentences, paragraphs, images, and videos based on their semantic meaning. Companies like Google and OpenAI provide embedding models that convert text data to high-dimensional vectors that capture semantic meaning for web-scale datasets.

The following are some of the popular embedding models for converting text data to embedding vectors:

Model Company Maximum input token length Embedding vector lengths Default embedding vector length
gemini-embedding-001 Google 2048 768, 1536, or 3072 3072
text-embedding-005 Google 2048 256, 512, or 768 768
text-embedding-3-large OpenAI 8192 256, 1024, or 3072 3072
text-embedding-3-small OpenAI 8192 512 or 1536 1536
voyage-3-large Voyage (Used by Anthropic) 32000 256, 512, 1024, or 2048 1024
voyage-multimodal-3 Voyage (Used by Anthropic) 32000 1024 1024

After processing the input data, the embeddings are stored in a vector database along with the input data and other metadata.

2. Storage

The data is stored in the vector database with attributes such as id, embedding, document, and metadata. For example, the words “Book” and “Website” can be stored in a vector database as follows:

{
"id": "1234",
"embedding": [1, 0.8, 0.3, 0.9, 0.6],
"document": "Book",
"metadata": {
"description": "A collection of pages.",
"created": '13-08-2025'
}
}
{
"id": "123412",
"embedding": [0.0, 0.7, 0.5, 0.6, 0.7],
"document": "Website",
"metadata": {
"description": "A collection of webpages.",
"created": '13-08-2025'
}
}

When we query a database, it fetches the documents with the most similar vector embeddings to the embedding vector of the input query. Searching through all the embeddings in a dataset to find similar vectors is expensive. Hence, vector databases use different indexing algorithms to store data in a manner that makes searches more efficient.

3. Indexing

Indexing in vector databases makes the vector search fast and efficient. The following are some of the popular indexing algorithms:

  • Hierarchical Navigable Small World (HNSW): In the HNSW indexing algorithm, each vector in the database is inserted into a graph where the edges connect similar vectors. When we search for a matching vector using a query, the database moves from an entry point towards graph nodes containing the most similar vectors. After finding the best match, the database returns the most similar vectors and their associated data as the query result. HNSW is used in vector databases like ChromaDB, Milvus, and Weaviate.
  • Inverted File Indexing (IVF): In IVF, the embedding vectors are grouped into centroids using clustering algorithms. When we query the database, the database searches for the matching results in the clusters with the nearest centroids to the query. IVF is used in vector databases like Milvus, Zilliz, and FAISS.
  • Product Quantization (PQ): PQ compresses high-dimensional vectors by splitting them into smaller subvectors and quantizes each subvector separately using a small set of representative centroids. Instead of storing the original embedding vectors, PQ stores only the centroid IDs for each subvector, reducing storage and memory requirements. When we query the database, it splits and quantizes the embedding vector of the query and computes distances using precomputed lookup tables without reconstructing full vectors.

4. Querying

Whenever we query a database using any input text, the database first converts the input into a vector embedding. Then, it searches for the most similar vectors in the database using an approximate nearest neighbor search and returns the documents with the most similar embeddings and their metadata.

Now that we have a theoretical understanding of how vector databases work, let’s see it in action using ChromaDB.

Using the Chroma vector database to store and search text data

We will install and use the ChromaDB vector database in Python to store and query data. Let’s discuss how to install ChromaDB and perform CRUD operations.

Install ChromaDB using PIP

We can install ChromaDB using PIP by executing the following command in the command-line terminal:

pip install chromadb

After installing ChromaDB, we need to initialize the vector database and create a client in our Python program to use it.

To initialize an in-memory database, you can use the Client() function defined in the chromadb module.

import chromadb
chroma_client = chromadb.Client()

In-memory ChromaDB instances are destroyed once the application terminates. To store and reuse the data, we can create a persistent ChromaDB client, which stores the data at a specified location in the file system. We will use the PersistentClient() function to do this. The PersistentClient() function takes the filepath of the storage location and initializes a database using SQLite.

You can initialize a persistent ChromaDB client with storage as follows:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")

Here, /home/aditya1117/codes/codecademy_resources/ is the directory where the database is stored, and chromadb is the name of the database.

We store the data in ChromaDB collections, where a collection may contain data from one source or task. Let’s create a ChromaDB collection to store the data.

Create a ChromaDB collection

We use the create_collection() method to create a ChromaDB collection. The create_collection() method takes the collection’s name and other metadata as inputs to the name and metadata parameters. After execution, it creates a collection with the given name in the database.

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.create_collection(
name = "codecademy_v1",
metadata={
"description": "Collection for storing data for Codecademy."
})

By default, ChromaDB uses the sentence transformers model all-MiniLM-L6-v2 to create embeddings. However, you can change the embedding model used by ChromaDB.

For example, you can specify Gemini as the embedding model for a collection using the GoogleGenerativeAiEmbeddingFunction() function, as shown in the following example:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.create_collection(
name = "codecademy_v2",
embedding_function = embedding_fn,
metadata = {
"description": "Collection for storing data for Codecademy using Gemini embeddings."
})

In this example, we created a new embedding function using the GoogleGenerativeAiEmbeddingFunction() function and passed it to the embedding_function parameter in the create_collection() method. As a result, the documents are stored using the Gemini embeddings instead of the default embeddings.

To use Gemini embeddings, you must have the Gemini API key. Similarly, you need an OpenAI API key to use OpenAI embeddings.

You can use OpenAI models for embedding data in a ChromaDB collection as follows:

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
embedding_fn = OpenAIEmbeddingFunction(api_key="Your_OpenAI_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.create_collection(
name = "codecademy_v3",
embedding_function = embedding_fn,
metadata = {
"description": "Collection for storing data for Codecademy using OpenAI embeddings."}
)

After creating a ChromaDB collection, let’s discuss how to access an existing ChromaDB collection.

Access a ChromaDB collection

We use the get_collection() method to access an existing ChromaDB collection. The get_collection() method takes the name of the ChromaDB collection as input to its name parameter and returns the Collection object.

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v1")

The get_collection() method runs into an error when the collection is not present in ChromaDB. For example, our database doesn’t have the codecademy_v4 collection. When we try to access this collection, the program runs into an error showing that the collection doesn’t exist.

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v4")

Output:

NotFoundError: Collection [codecademy_v4] does not exists

Instead of the get_collection() method, we can use the get_or_create_collection() method to access a ChromaDB collection.

The get_or_create_collection() creates a new collection if the collection is not present in the vector database, as shown in the following code:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_or_create_collection(name = "codecademy_v4")

This code creates a new collection codecademy_v4 instead of throwing an exception.

If we have used an embedding function other than the default embedding function while creating a ChromaDB collection, we must also pass the embedding function to the get_collection() method or the get_or_create_collection() method.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)

If we don’t pass the embedding function to the get_collection() or the get_or_create_collection() method, the collection object switches to the default embedding model. Due to this, the length of embedding vectors will be different for the current session, and their semantic meaning will also change, leading to an error while querying data.

After getting the collection, we can perform different read or write operations. Let’s discuss how to add data to a ChromaDB collection.

Add data to a ChromaDB collection

We can add data to a ChromaDB collection using the add() method. Let’s discuss how to prepare and add data to a ChromaDB collection.

Prepare dataset

We will insert the following text into the collection:

Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development. It introduces learners to AI concepts such as neural networks, natural language processing, and generative models, while emphasizing hands-on practice through interactive coding exercises. The platform guides users in building AI-powered applications with LangChain, enabling large language models to interact with real-world tools and data. It also covers Retrieval-Augmented Generation (RAG), showing how to design systems that draw information from custom datasets to enhance the accuracy and relevance of LLM responses. Learners gain experience with vector databases like ChromaDB and Pinecone, understanding how to create, store, and search vector embeddings for use in search engines, recommendation systems, and RAG workflows. Through its interactive coding environment, instant feedback, and project-based learning, Codecademy ensures learners can translate theory into real-world AI solutions.""" 

As given in the table on the embedding models, each embedding model has an input token length limit. If we pass a text with more tokens than the maximum token limit, the embedding model truncates the input, leading to information loss. Hence, we need to ensure that the maximum length of the input text doesn’t exceed the maximum token length of the embedding model.

To ensure that the input text is shorter than the maximum token length, we will insert it by splitting it into sentences. We will first download the necessary tokenizers using the punkt_tab and punkt modules in the nltk library to do this.

import nltk
nltk.download('punkt_tab')
nltk.download('punkt')

After downloading the modules, we will use the sent_tokenize() function in the nltk module to transform the input text into sentences as follows:

from nltk.tokenize import sent_tokenize
input_text = """Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development. It introduces learners to AI concepts such as neural networks, natural language processing, and generative models, while emphasizing hands-on practice through interactive coding exercises. The platform guides users in building AI-powered applications with LangChain, enabling large language models to interact with real-world tools and data. It also covers Retrieval-Augmented Generation (RAG), showing how to design systems that draw information from custom datasets to enhance the accuracy and relevance of LLM responses. Learners gain experience with vector databases like ChromaDB and Pinecone, understanding how to create, store, and search vector embeddings for use in search engines, recommendation systems, and RAG workflows. Through its interactive coding environment, instant feedback, and project-based learning, Codecademy ensures learners can translate theory into real-world AI solutions."""
sentences = sent_tokenize(input_text)
print("The input sentences are:")
print(sentences)

Output:

The input sentences are:
['Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development.', 'It introduces learners to AI concepts such as neural networks, natural language processing, and generative models, while emphasizing hands-on practice through interactive coding exercises.', 'The platform guides users in building AI-powered applications with LangChain, enabling large language models to interact with real-world tools and data.', 'It also covers Retrieval-Augmented Generation (RAG), showing how to design systems that draw information from custom datasets to enhance the accuracy and relevance of LLM responses.', 'Learners gain experience with vector databases like ChromaDB and Pinecone, understanding how to create, store, and search vector embeddings for use in search engines, recommendation systems, and RAG workflows.Through its interactive coding environment, instant feedback, and project-based learning, Codecademy ensures learners can translate theory into real-world AI solutions.']

After converting the input text data into sentences, we can insert it into a ChromaDB collection.

Insert data into the ChromaDB collection

When invoked on a ChromaDB collection object, the add () method takes a list of IDs and document strings as its input. We will first create a list of IDs for the strings in the input data.

num_inputs=len(sentences)
id_list=["text_"+str(i) for i in range(num_inputs)]
print("The IDs are:")
print(id_list)

Output:

The IDs are:
['text_0', 'text_1', 'text_2', 'text_3', 'text_4']

Next, we will pass the IDs and the list of input strings to the ids and documents parameters in the add() method as follows:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v1")
collection.add(
ids=id_list,
documents = sentences
)

If you are using a specific embedding model, you must pass the embedding function to the get_collection() method while adding the data to the collection.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
collection.add(
ids=id_list,
documents = sentences
)

After inserting the data into the ChromaDB collection, let’s read it from the Chroma vector database.

Read data from ChromaDB

ChromaDB provides different methods for counting the number of documents, reading the first few documents in the database, and reading data from a collection based on a query. Let’s discuss how to read data from a given collection.

Read data from a ChromaDB collection

You can get the number of documents in a ChromaDB collection by invoking the count() method on the collection as follows:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
print("The number of documents in the collection are:")
print(collection.count())

Output:

print("The number of documents in the collection are:")
print(collection.count())

Similarly, you can look at the top few documents in a ChromaDB collection using the peek() method. The peek() method, when invoked on a ChromaDB collection, takes the number of documents to read as its input and returns the documents. For example, we can get the first three documents of the codecademy_v2 collection using the peek() method as follows:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.peek(3)
print("The first three documents in the collection are:")
print(documents)

Output:

The first three documents in the collection are:
{'ids': ['text_0', 'text_1', 'text_2'], 'embeddings': array([[ 0.01761805, -0.05082946, -0.03045218, ..., 0.02625313,
-0.06637532, 0.03064525],
[ 0.03014025, -0.04932779, -0.02771148, ..., 0.02402657,
-0.04188502, 0.03635174],
[ 0.03077473, -0.02291265, -0.06256586, ..., 0.02889462,
-0.05969379, 0.03796252]]), 'documents': ['Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development.', 'It introduces learners to AI concepts such as neural networks, natural language processing, and generative models, while emphasizing hands-on practice through interactive coding exercises.', 'The platform guides users in building AI-powered applications with LangChain, enabling large language models to interact with real-world tools and data.'], 'uris': None, 'included': ['metadatas', 'documents', 'embeddings'], 'data': None, 'metadatas': [None, None, None]}

You can also get the length of the embedding vectors using the embeddings attribute of the documents as follows:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.peek(3)
print("The embedding vector length is:")
print(len(documents['embeddings'][0]))

Output:

The embedding vector length is:
768

We used the Gemini embedding function while creating the codecademy_v2 collection. Hence, the length of the embedding vectors is 768.

Apart from these read operations, we can query a ChromaDB collection using an input text. Let’s discuss how to do so.

Query a ChromaDB collection

We use the query() method to query data from a ChromaDB collection. The query() method takes a list of input queries as input to its query_texts parameter and the number of matching documents to retrieve from the collection as an input to the n_results parameter. After execution, it returns n_results documents from the database most similar to the input queries.

For example, we can get the two most similar documents in the codecademy_v2 collection for the query “What is codecademy” using the query() method as follows:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.query(query_texts=["What is Codecademy"], n_results=2)
print("The query output is:")
print(documents)

Output:

The query output is:
{'ids': [['text_0', 'text_4']], 'embeddings': None, 'documents': [['Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development.', 'Learners gain experience with vector databases like ChromaDB and Pinecone, understanding how to create, store, and search vector embeddings for use in search engines, recommendation systems, and RAG workflows. Through its interactive coding environment, instant feedback, and project-based learning, Codecademy ensures learners can translate theory into real-world AI solutions.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[0.22402796149253845, 0.40515339374542236]]}

This output shows that the first string, Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development, is the most relevant answer to the query. Hence, it is the first output of the query() method.

The documents in a ChromaDB collection can have different attributes. While executing the query() method, we can select what attributes to retrieve in the output by passing the desired attribute names to the include parameter. For example, if we want to get only the vector embeddings and the text data in the output of the query() method, we can pass the ["embeddings", "documents"] list to the include parameter:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.query(query_texts=["What is Codecademy"], n_results=2, include=["embeddings", "documents"])
print("The query output is:")
print(documents)

Output:

The query output is:
{'ids': [['text_0', 'text_4']], 'embeddings': [array([[ 0.01761805, -0.05082946, -0.03045218, ..., 0.02625313,
-0.06637532, 0.03064525],
[ 0.02111817, -0.05757593, -0.05742461, ..., 0.01251548,
-0.06274018, -0.00149729]])], 'documents': [['Codecademy is an online learning platform that has evolved from teaching basic programming skills to offering in-depth training in artificial intelligence and modern AI application development.', 'Learners gain experience with vector databases like ChromaDB and Pinecone, understanding how to create, store, and search vector embeddings for use in search engines, recommendation systems, and RAG workflows.Through its interactive coding environment, instant feedback, and project-based learning, Codecademy ensures learners can translate theory into real-world AI solutions.']], 'uris': None, 'included': ['embeddings', 'documents'], 'data': None, 'metadatas': None, 'distances': None}

Apart from storing and retrieving data, we can also update and delete data in the ChromaDB vector database. Let’s discuss how to update data in ChromaDB.

Update data in ChromaDB

ChromaDB provides different methods for updating collections and their documents. Let’s first discuss how to update a collection in ChromaDB.

Update collections in ChromaDB

We can change a ChromaDB collection’s name and metadata using the modify() method. The modify() method takes the new name and metadata as input to its name and metadata parameters and updates the collection.

For example, we can change the name and metadata of the codecademy_v1 collection as follows:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v1")
collection.modify(name="codecademy", metadata={"description": "Updated description for collection for storing data for Codecademy."})

After modifying the details, we can access the collection using the new name as follows:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
updated_collection = chroma_client.get_collection(name = "codecademy")
print("The number of documents in the collection is:")
print(updated_collection.count())

Output:

The number of documents in the collection is:
5

As we have renamed the codecademy_v1 collection to codecademy, reading the codecademy_v1 collection leads to an error:

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v1")
print("Number of documents in the collection are:")
print(collection.count())

Output:

NotFoundError: Collection [codecademy_v1] does not exists

Apart from updating the details of a ChromaDB collection, we can also update the documents in a collection. Let’s discuss how to do so.

Update documents in a ChromaDB collection

We use the update() method to update documents in a collection. It takes a list of IDs for which we want to update the documents and a list of documents as its input. After execution, it updates the documents for the given IDs.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
collection.update(ids=["text_4"],documents=["This is a sample text to show the update operation."])

In this code, we have updated the text document for document ID text_4 to This is a sample text to show the update operation.

Now, let’s try to get the document with ID text_4, the fifth document in the collection.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.peek(5)
document=documents["documents"][4]
print("The fifth document in the collection is:")
print(document)

Output:

The fifth document in the collection is:
This is a sample text to show the update operation.

As you can see, the document has been successfully updated for ID text_4. If we pass a non-existing ID to the update() method, it runs into an error.

We can use the upsert() method to insert the new IDs along with the documents while updating them. The upsert() method updates the documents for existing IDs and inserts the documents with new IDs.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
collection.upsert(ids=["text_4","text_5"],
documents=["This is a sample text to show the upsert operation.","This is a sample text for upsert operation."])

In this example, we passed documents for IDs text_4 and text_5 to the upsert() method. As the ID text_4 already exists in the database, ChromaDB updates the document. The ID text_5 doesn’t exist in the collection. Hence, the text with ID text_5 is inserted into the database. You can verify this by querying the collection as follows:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
documents=collection.query(query_texts=["Sample text"], n_results=2)
print("The query output is:")
print(documents)

Output:

The query output is:
{'ids': [['text_4', 'text_5']], 'embeddings': None, 'documents': [['This is a sample text to show the upsert operation.', 'This is a sample text for upsert operation.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[0.20156435668468475, 0.2676506042480469]]}

We can also delete data from a ChromaDB collection besides updating the documents. Let’s discuss how to do so.

Delete data from ChromaDB

To delete specific documents from a ChromaDB collection, you can pass the list of document IDs to the delete() method.

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
print("Number of documents in the collection before deletion:")
print(collection.count())
collection.delete(ids=["text_5"])
print("Number of documents in the collection after deletion:")
print(collection.count())

Output:

Number of documents in the collection before deletion:
6
Number of documents in the collection after deletion:
5

You can also delete documents from a ChromaDB collection using the metadata of the documents. The delete() method allows us to specify the conditions to delete data using the where parameter in the delete() method. For instance, we can delete all the documents with source as article using the following code:

import chromadb
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction
embedding_fn = GoogleGenerativeAiEmbeddingFunction(api_key="Your_Gemini_API_Key")
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
collection = chroma_client.get_collection(name = "codecademy_v2", embedding_function=embedding_fn)
print("Number of documents in the collection before deletion:")
print(collection.count())
collection.delete(where = {"source":"article"})
print("Number of documents in the collection after deletion:")
print(collection.count())

Output:

Number of documents in the collection before deletion:
5
Number of documents in the collection after deletion:
5

As we don’t have any documents with metadata, the condition source==article isn’t true for any document. Due to this, no documents are deleted from the database, and the number of documents remains unchanged. You can choose any attribute from the metadata to delete documents using the where parameter in the delete() method.

We can also delete an entire ChromaDB collection if necessary. For this, we use the delete_collection() method. When invoked on a ChromaDB client, the delete_collection() method takes the collection’s name as its input and deletes it.

import chromadb
chroma_client = chromadb.PersistentClient(path = "/home/aditya1117/codes/codecademy_resources/chromadb")
chroma_client.delete_collection("codecademy")

Deleting a collection or documents from a collection is a destructive operation, and you cannot undo this operation. Hence, be careful while deleting data from ChromaDB.

We have discussed the basics of vector databases and hands-on examples of inserting, reading, updating, and deleting data from vector databases using the ChromaDB vector database. Now, let’s discuss using vector databases to build AI applications.

Using vector databases to build AI applications

Vector databases help us store, index, and retrieve data using the semantic meaning of the datasets instead of exact keywords. This helps us retrieve text and examples from the database that are semantically similar to a query. We can use the following steps to use vector databases in AI applications.

  1. First, we will generate embeddings for the text data and store the embeddings in a vector database like ChromaDB.
  2. Whenever a user asks the AI application a specific query, we can generate the query’s embedding and fetch semantically similar documents from the database.
  3. Next, we can use the retrieved documents to generate accurate responses for the user query.

Using the above steps, we can build chatbots and other AI applications using proprietary or domain-specific datasets to generate informed and accurate responses. Now, let’s look at some of the popular databases that we can use in our AI applications.

The following are some of the popular vector databases that we can use in our AI applications:

  1. Pinecone: It is a fully managed cloud-native vector database built for real-time similarity search. Pinecone is popular for scalability and low-latency query processing.
  2. Milvus: Milvus is an open-source and highly scalable vector database optimized for enterprise-scale data. It supports multiple indexing algorithms and is popular in enterprise AI applications.
  3. Weaviate: It is an open-source vector database with a modular architecture. Weaviate supports keyword search along with vector search.
  4. Qdrant: It is an open-source vector database focused on performance and developer-friendly APIs. Qdrant provides payload filtering and efficient storage for semantic search.
  5. ChromaDB: ChromaDB is an open-source embedding database designed for large language model (LLM) applications. We often use it to build RAG pipelines with LangChain.
  6. Redis: Redis is popular for low-latency read/write operations. It is an in-memory key-value database. However, Redis now supports vector similarity search via extensions.

With the increasing popularity of AI applications, relational and graph databases like Apache Cassandra, Elasticsearch, MySQL, MongoDB, Neo4j, etc., have also introduced vector search functionalities. Thus, we can also use these databases as vector databases.

Conclusion

Vector databases make it possible to store and search data by meaning, which helps us build powerful AI applications with semantic search and RAG. In this article, we discussed the basics of vector databases and how they work. We also discussed working with the ChromaDB vector database to store, query, update, and delete documents.

To learn more about how vector databases work, you can go through this course on Creating AI Applications using Retrieval-Augmented Generation (RAG) that discusses how to build a RAG app with Streamlit and ChromaDB. You might also like this finetuning transformer models course that discusses LLM finetuning with LoRA, QLoRA, and Hugging Face.

Frequently asked questions

1. Is MongoDB a vector DB?

MongoDB is not a vector database. It is a document database best suited for structured JSON data. However, using the Atlas vector search feature, we can use MongoDB as a vector database. Atlas vector search allows us to store and use vector embeddings alongside traditional data within the same MongoDB database.

2. Is Neo4j a vector DB?

Neo4j is not a vector database. It is a graph database used to store and query data as nodes, relationships, and properties. However, Neo4j supports vector similarity search, which helps it function as a vector database within its graph structure. We can store and query vector embeddings alongside the graph data.

3. Why is vector DB used?

Vector databases are used to store and query high-dimensional vector data. Using vector databases, we can convert text, images, audio, and video data into vector embeddings that capture the semantic meaning of the data. Then, we can query the stored vector embeddings in AI applications.

4. What is the difference between SQL DB and vector DB?

SQL databases store structured tabular data and query data by matching exact values. On the other hand, vector databases store data as vector embeddings and query data by similarity search using distance metrics like cosine similarity or Euclidean distance.

5. What are the advantages of a vector database?

Vector databases help us store and query data using its semantic meaning instead of keyword matching. This makes them ideal for applications like semantic search, recommendation systems, and retrieval-augmented generation (RAG). Vector databases are also optimized for high-dimensional similarity search using advanced indexing techniques, which makes querying vector databases efficient for large-scale AI applications.

6. What are the disadvantages of a vector database?

Vector databases are resource-intensive, as storing and indexing high-dimensional vector embeddings requires significant memory and compute. Also, vector databases retrieve most similar results using approximate nearest neighbor search, which is fast but not always accurate. Hence, there can be accuracy tradeoffs while using vector databases.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Learn how to give your large language model the powers of retrieval with RAG, and build a RAG app with Streamlit and ChromaDB.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn how to query SQL databases and design relational databases to efficiently store large quantities of data.
    • Includes 5 Courses
    • With Certificate
    • Beginner Friendly.
      13 hours