Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safe? #1726

Open
drupol opened this issue Jan 21, 2025 · 0 comments
Open

Thread safe? #1726

drupol opened this issue Jan 21, 2025 · 0 comments

Comments

@drupol
Copy link

drupol commented Jan 21, 2025

Hello,

I have been reviewing the documentation and would like to confirm whether encoding text using Tokenizer.from_file(str(tokenizer_path)).encode(text) is thread-safe.

Below is a simplified version of my implementation for context:

@final
class HuggingFaceTokenizer:
    _tokenizer = None

    def __init__(self):
        tokenizer_path = Path("foobar", "tokenizers", "tokenizer.json")
        self._tokenizer: Tokenizer = Tokenizer.from_file(str(tokenizer_path))
    
    @override
    def tokens_count(self, text: str | None) -> int:
        """Count the number of tokens in the provided text."""
        return 0 if not text else len(self._tokenizer.encode(text))

My concern is whether the _tokenizer.encode() method is inherently safe to use across multiple threads without additional synchronization mechanisms (e.g., locks).

If thread safety is not guaranteed, I am considering implementing a thread-safe mechanism as follows:

@final
class HuggingFaceTokenizer:
    _tokenizer: Tokenizer | None = None
    _lock = threading.Lock()

    def __init__(self):
        tokenizer_path = Path("foobar", "tokenizers", "tokenizer.json")
        with self._lock:
            if not self._tokenizer:
                self._tokenizer = Tokenizer.from_file(str(tokenizer_path))

    @override
    def tokens_count(self, text: str | None) -> int:
        """Count the number of tokens in the provided text."""
        if not text:
            return 0

        with self._lock:
            return len(self._tokenizer.encode(text))

Could you please clarify:

  1. Is the encode method inherently thread-safe?
  2. If not, is the approach above (using a lock) sufficient to ensure thread safety?
  3. Are there any recommended best practices for safely using Tokenizer in multithreaded environments?

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant