Reports

Disclaimer
I am assuming you are using the latest versions of the Python packages mentioned. At the time of writing, these are:

langchain version 0.3.14
langchain-chroma version 0.2.0

If this is not the case, please explicitly include the versions you are using so we can provide more accurate assistance.

To check if a document exists in the vector store based on its metadata, the .get() function is your best option.

Here’s a summary of how it works:

Set the limit (k): This specifies the maximum number of results to retrieve.
Use a where query: Utilize the Metadata Filtering feature provided by Chroma. As described in this documentation:

"An optional where filter dictionary can be supplied to filter by the metadata associated with each document."

Details on configuring the where filter are available here.

Once configured, you're all set. For example, the following snippet demonstrates the functionality:

existing_metadata = db.get(
    limit=1,
    where={"id": {"$eq": "ABC123"}}
)["metadatas"]

This code returns a list (limited to one element) containing the metadata of documents that match the where condition.

Below is a complete code example to illustrate how this works:

import os
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from dotenv import load_dotenv, find_dotenv

# Load environment variables
load_dotenv(find_dotenv(".env"), override=True)

# Prepare embeddings and the vector store
embeddings = AzureOpenAIEmbeddings(
    api_key=os.environ.get("AZURE_OPENAI_EMBEDDINGS_API_KEY"),
    api_version=os.environ.get("AZURE_OPENAI_EMBEDDINGS_VERSION"),
    azure_deployment=os.environ.get("AZURE_OPENAI_EMBEDDINGS_MODEL"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_EMBEDDINGS_ENDPOINT")
)
db = Chroma(
    persist_directory=os.environ.get("CHROMA_PATH"),
    embedding_function=embeddings,
    collection_name="stackoverflow-help",
)

# Add documents to the vector store
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=int(os.environ["CHROMA_EMBEDDINGS_CHUNK_SIZE"]),
    chunk_overlap=int(os.environ["CHROMA_EMBEDDINGS_CHUNK_OVERLAP"])
)

documents = text_splitter.create_documents(["This is a test document for the Chroma database."])
for doc in documents:
    doc.metadata = {"id": "ABC123"}
db.add_documents(documents)

# Check if the document is in the vector store
existing_metadata = db.get(
    limit=1,
    where={"id": {"$eq": "ABC123"}}
)["metadatas"]
print(existing_metadata)

# Check for a document that is not in the vector store
non_existing_metadata = db.get(
    limit=1,
    where={"id": {"$eq": "XYZ123"}}
)["metadatas"]
print(non_existing_metadata)

When you run this code, the results will be as follows:

[{'id': 'ABC123'}]  # Output of print(existing_metadata)
[]  # Output of print(non_existing_metadata)

79358796