Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

Open
jzclever opened this issue Feb 12, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@jzclever
Copy link

What happened?

Running with Python 3.13.0 on an M3 Macbook

Macbook M3
MacOS 15.3.1
Python 3.13.0
VS Code 1.97.0
VS Code Jupyter Extension 2025.1.0
Chroma 0.6.3

Execute cell 1 of notebook _t1.ipynb running in VS Code:

import chromadb, os 
op = os.path 
curr_dir = op.abspath(os.getcwd())
chroma_path = op.join(curr_dir, 'chroma')
client = chromadb.PersistentClient(chroma_path)

From a separate terminal, execute the script _t2.py

import chromadb, os

op = os.path 
curr_dir = op.abspath(op.dirname(__file__))
chroma_path = op.join(curr_dir, 'chroma')
client = chromadb.PersistentClient(chroma_path)

t = client.get_or_create_collection('test')

for n in range(1,6):
    new_id = f'id_{n}'
    new_doc = f'doc_{n}'
    t.upsert(ids=[new_id], documents=[new_doc])

result = t.get(include=['documents','embeddings'])

print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
print(f'{len(result["embeddings"][0]) = }')

Confirm the expected output in the console:

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 5
len(result["embeddings"][0]) = 384

Run cell 2 of _t1.ipynb:

t = client.get_or_create_collection('test')

result = t.get(include=['documents','embeddings'])

print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
# print(f'{len(result["embeddings"][0]) = }')

result = t.query(
    query_texts=['doc_1'], 
    include=['documents','embeddings'],
    n_results=5
)

print(f"\n {'<>' * 10 } \n")
print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
print(f'{len(result["embeddings"][0]) = }')
print(f'{result["embeddings"]         = }')

Confirm that the ids and documents are there, but embeddings are not

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 0

 <><><><><><><><><><> 

result["ids"]                = [[]]
result["documents"]          = [[]]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 0
result["embeddings"]         = [array([], dtype=float64)]

Not only is this buggy, but it is also inconsistent.

On a separate trial, I managed to (somehow) get one embedding properly stored, even though t.get() was showing 5 total documents (so 4 were still missing).

The really baffling part came with t.query() warning that I only had 4 existing elements, despite the request for n_results=5.

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 384

 <><><><><><><><><><> 

Number of requested results 5 is greater than number of elements in index 4, updating n_results = 4
result["ids"]                = [['id_1', 'id_2', 'id_3', 'id_4']]
result["documents"]          = [['doc_1', 'doc_2', 'doc_3', 'doc_4']]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 1

This behavior occurs for image-based collections as well.

The behavior also occurs with collection.add (it is not specific to collection.upsert, as with my example).

The behavior also occurs if I create the collection for the first time in the notebook cell, as opposed to in the script (as with my current example).

If I restart the kernel and run the notebook cells again, I get the expected output:

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 5

 <><><><><><><><><><> 

result["ids"]                = [['id_1', 'id_2', 'id_3', 'id_4', 'id_5']]
result["documents"]          = [['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 5

I have also confirmed that this issue does not exist in Google Colab.

I can run the first cell (to instantiate the persistent client), then write my _t2.py to the local Colab file system and call it with %run _t2.py, and then verify the existence of the newly inserted content (with embeddings) by running the second cell of the still-active notebook.

Versions

Macbook M3
MacOS 15.3.1

Python 3.13.0
VS Code 1.97.0
VS Code Jupyter Extension 2025.1.0

Chroma 0.6.3

Relevant log output

@jzclever jzclever added the bug Something isn't working label Feb 12, 2025
@tazarov tazarov self-assigned this Feb 12, 2025
@tazarov
Copy link
Contributor

tazarov commented Feb 12, 2025

@jzclever, if I understand your problem correctly, you are trying to read/update data in Chroma from two different processes - a notebook and a python script (e.g. python _t2.py). If that is the case, the problem you are hitting is that Chroma is not process-safe. You cannot expect to hit the same persistent dir from two different process (a notebook and a separate python script), which is also why running the whole thing from a single Google Colab works fine.

Just as an experiment try running Chroma as a server and use HttpClient instead of PersistentClient in both the notebook and the script to see the difference.

@jzclever
Copy link
Author

jzclever commented Feb 12, 2025

Thanks @tazarov, launching a server and then connecting via HttpClient from both the notebook and the script does in fact appear to resolve the issue—allowing the notebook to see live updates from writes performed by the script.

Insofar as the behavior of the PersistentClient, thanks for pointing out that chroma is not process-safe (I was not aware). However, I am not clear on your explanation, specifically "You cannot expect to hit the same persistent dir from two different process".

I only have a single directory to which I am connecting to from both the script and notebook (./chroma). So instantiating a PersistentClient from both notebook and script must point to the same location.

This is corroborated by the fact that script-based writes to the persistent dir do result (most of the time) in the notebook being able to see the newly added ids and documents. It is just the embeddings that seem to get stuck in some intermediary space that is not accessible to the notebook until kernel restart.

As for why it works in Colab — I guess the suggestion is that its because a colab session is really all happening within a single event loop that is running in the background?

@tazarov
Copy link
Contributor

tazarov commented Feb 16, 2025

Insofar as the behavior of the PersistentClient, thanks for pointing out that chroma is not process-safe (I was not aware). However, I am not clear on your explanation, specifically "You cannot expect to hit the same persistent dir from two different process".

Sorry if I cause confusion. What I meant is that users should not expect consistent results if they attempt to access the same persistent dir from two Chroma instances.

Here's a diagram to illustrate what each process (assuming both start at the same time) will see:

Image

As for why it works in Colab — I guess the suggestion is that its because a colab session is really all happening within a single event loop that is running in the background?

Colab runs in a single process. Accessing the same persistent dir even if you create a new PersistentClient will have a consistent view of Chroma data - Chroma has a client caching mechanism for persistent clients which ensures that all newly created persistent clients point to the same view of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants