You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been reviewing the documentation and would like to confirm whether encoding text using Tokenizer.from_file(str(tokenizer_path)).encode(text) is thread-safe.
Below is a simplified version of my implementation for context:
@finalclassHuggingFaceTokenizer:
_tokenizer=Nonedef__init__(self):
tokenizer_path=Path("foobar", "tokenizers", "tokenizer.json")
self._tokenizer: Tokenizer=Tokenizer.from_file(str(tokenizer_path))
@overridedeftokens_count(self, text: str|None) ->int:
"""Count the number of tokens in the provided text."""return0ifnottextelselen(self._tokenizer.encode(text))
My concern is whether the _tokenizer.encode() method is inherently safe to use across multiple threads without additional synchronization mechanisms (e.g., locks).
If thread safety is not guaranteed, I am considering implementing a thread-safe mechanism as follows:
@finalclassHuggingFaceTokenizer:
_tokenizer: Tokenizer|None=None_lock=threading.Lock()
def__init__(self):
tokenizer_path=Path("foobar", "tokenizers", "tokenizer.json")
withself._lock:
ifnotself._tokenizer:
self._tokenizer=Tokenizer.from_file(str(tokenizer_path))
@overridedeftokens_count(self, text: str|None) ->int:
"""Count the number of tokens in the provided text."""ifnottext:
return0withself._lock:
returnlen(self._tokenizer.encode(text))
Could you please clarify:
Is the encode method inherently thread-safe?
If not, is the approach above (using a lock) sufficient to ensure thread safety?
Are there any recommended best practices for safely using Tokenizer in multithreaded environments?
Thank you in advance!
The text was updated successfully, but these errors were encountered:
Hello,
I have been reviewing the documentation and would like to confirm whether encoding text using
Tokenizer.from_file(str(tokenizer_path)).encode(text)
is thread-safe.Below is a simplified version of my implementation for context:
My concern is whether the
_tokenizer.encode()
method is inherently safe to use across multiple threads without additional synchronization mechanisms (e.g., locks).If thread safety is not guaranteed, I am considering implementing a thread-safe mechanism as follows:
Could you please clarify:
encode
method inherently thread-safe?Tokenizer
in multithreaded environments?Thank you in advance!
The text was updated successfully, but these errors were encountered: