[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

jzclever · 2025-02-12T01:45:40Z

What happened?

Running with Python 3.13.0 on an M3 Macbook

Macbook M3
MacOS 15.3.1
Python 3.13.0
VS Code 1.97.0
VS Code Jupyter Extension 2025.1.0
Chroma 0.6.3

Execute cell 1 of notebook _t1.ipynb running in VS Code:

import chromadb, os 
op = os.path 
curr_dir = op.abspath(os.getcwd())
chroma_path = op.join(curr_dir, 'chroma')
client = chromadb.PersistentClient(chroma_path)

From a separate terminal, execute the script _t2.py

import chromadb, os

op = os.path 
curr_dir = op.abspath(op.dirname(__file__))
chroma_path = op.join(curr_dir, 'chroma')
client = chromadb.PersistentClient(chroma_path)

t = client.get_or_create_collection('test')

for n in range(1,6):
    new_id = f'id_{n}'
    new_doc = f'doc_{n}'
    t.upsert(ids=[new_id], documents=[new_doc])

result = t.get(include=['documents','embeddings'])

print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
print(f'{len(result["embeddings"][0]) = }')

Confirm the expected output in the console:

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 5
len(result["embeddings"][0]) = 384

Run cell 2 of _t1.ipynb:

t = client.get_or_create_collection('test')

result = t.get(include=['documents','embeddings'])

print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
# print(f'{len(result["embeddings"][0]) = }')

result = t.query(
    query_texts=['doc_1'], 
    include=['documents','embeddings'],
    n_results=5
)

print(f"\n {'<>' * 10 } \n")
print(f'{result["ids"]                = }')
print(f'{result["documents"]          = }')
print(f'{len(result["embeddings"])    = }')
print(f'{len(result["embeddings"][0]) = }')
print(f'{result["embeddings"]         = }')

Confirm that the ids and documents are there, but embeddings are not

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 0

 <><><><><><><><><><> 

result["ids"]                = [[]]
result["documents"]          = [[]]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 0
result["embeddings"]         = [array([], dtype=float64)]

Not only is this buggy, but it is also inconsistent.

On a separate trial, I managed to (somehow) get one embedding properly stored, even though t.get() was showing 5 total documents (so 4 were still missing).

The really baffling part came with t.query() warning that I only had 4 existing elements, despite the request for n_results=5.

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 384

 <><><><><><><><><><> 

Number of requested results 5 is greater than number of elements in index 4, updating n_results = 4
result["ids"]                = [['id_1', 'id_2', 'id_3', 'id_4']]
result["documents"]          = [['doc_1', 'doc_2', 'doc_3', 'doc_4']]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 1

This behavior occurs for image-based collections as well.

The behavior also occurs with collection.add (it is not specific to collection.upsert, as with my example).

The behavior also occurs if I create the collection for the first time in the notebook cell, as opposed to in the script (as with my current example).

If I restart the kernel and run the notebook cells again, I get the expected output:

result["ids"]                = ['id_1', 'id_2', 'id_3', 'id_4', 'id_5']
result["documents"]          = ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
len(result["embeddings"])    = 5

 <><><><><><><><><><> 

result["ids"]                = [['id_1', 'id_2', 'id_3', 'id_4', 'id_5']]
result["documents"]          = [['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']]
len(result["embeddings"])    = 1
len(result["embeddings"][0]) = 5

I have also confirmed that this issue does not exist in Google Colab.

I can run the first cell (to instantiate the persistent client), then write my _t2.py to the local Colab file system and call it with %run _t2.py, and then verify the existence of the newly inserted content (with embeddings) by running the second cell of the still-active notebook.

Versions

Macbook M3
MacOS 15.3.1

Python 3.13.0
VS Code 1.97.0
VS Code Jupyter Extension 2025.1.0

Chroma 0.6.3

Relevant log output

The text was updated successfully, but these errors were encountered:

tazarov · 2025-02-12T14:20:46Z

@jzclever, if I understand your problem correctly, you are trying to read/update data in Chroma from two different processes - a notebook and a python script (e.g. python _t2.py). If that is the case, the problem you are hitting is that Chroma is not process-safe. You cannot expect to hit the same persistent dir from two different process (a notebook and a separate python script), which is also why running the whole thing from a single Google Colab works fine.

Just as an experiment try running Chroma as a server and use HttpClient instead of PersistentClient in both the notebook and the script to see the difference.

jzclever · 2025-02-12T17:09:46Z

Thanks @tazarov, launching a server and then connecting via HttpClient from both the notebook and the script does in fact appear to resolve the issue—allowing the notebook to see live updates from writes performed by the script.

Insofar as the behavior of the PersistentClient, thanks for pointing out that chroma is not process-safe (I was not aware). However, I am not clear on your explanation, specifically "You cannot expect to hit the same persistent dir from two different process".

I only have a single directory to which I am connecting to from both the script and notebook (./chroma). So instantiating a PersistentClient from both notebook and script must point to the same location.

This is corroborated by the fact that script-based writes to the persistent dir do result (most of the time) in the notebook being able to see the newly added ids and documents. It is just the embeddings that seem to get stuck in some intermediary space that is not accessible to the notebook until kernel restart.

As for why it works in Colab — I guess the suggestion is that its because a colab session is really all happening within a single event loop that is running in the background?

tazarov · 2025-02-16T08:50:41Z

Insofar as the behavior of the PersistentClient, thanks for pointing out that chroma is not process-safe (I was not aware). However, I am not clear on your explanation, specifically "You cannot expect to hit the same persistent dir from two different process".

Sorry if I cause confusion. What I meant is that users should not expect consistent results if they attempt to access the same persistent dir from two Chroma instances.

Here's a diagram to illustrate what each process (assuming both start at the same time) will see:

As for why it works in Colab — I guess the suggestion is that its because a colab session is really all happening within a single event loop that is running in the background?

Colab runs in a single process. Accessing the same persistent dir even if you create a new PersistentClient will have a consistent view of Chroma data - Chroma has a client caching mechanism for persistent clients which ensures that all newly created persistent clients point to the same view of the data.

jzclever added the bug Something isn't working label Feb 12, 2025

tazarov self-assigned this Feb 12, 2025

amd-tibbetso mentioned this issue Feb 13, 2025

[Bug]: Query does not match Get on persistent DB accessed by two processes. #3792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

jzclever commented Feb 12, 2025

tazarov commented Feb 12, 2025

jzclever commented Feb 12, 2025 •

edited

Loading

tazarov commented Feb 16, 2025

[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

[Bug]: ids and docs added to db, but embeddings only appear after restart of Jupyter kernel #3769

Comments

jzclever commented Feb 12, 2025

What happened?

Versions

Relevant log output

tazarov commented Feb 12, 2025

jzclever commented Feb 12, 2025 • edited Loading

tazarov commented Feb 16, 2025

jzclever commented Feb 12, 2025 •

edited

Loading