Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()" #1738

Open
nixonjin opened this issue Feb 21, 2025 · 1 comment
Open

Comments

@nixonjin
Copy link

nixonjin commented Feb 21, 2025

I met an OOM problem when using BertTokenizer as #1539 reports.

Then I use tokenizer._tokenizer.model.clear_cache() or tokenizer._tokenizer.model._clear_cache() to clear cache.

However, I met an error: AttributeError: 'tokenizers.models.WordPiece' object has no attribute 'clear_cache', could anyone tell me how to fix it?

In the source code, It seems like clear_cache only supports BPE and Unigram tokenizer, not wordpiece tokenizer, is it the reason? if it is, could anyone give me some advice to fix this problem?

environment:
run on linux with only cpu
tokenizers==0.21.0
transformers==4.49.0

@nixonjin nixonjin changed the title ERROR occur where running "tokenizer._tokenizer.model.clear_cache()" ERROR occurs where running "tokenizer._tokenizer.model.clear_cache()" Feb 21, 2025
@nixonjin nixonjin changed the title ERROR occurs where running "tokenizer._tokenizer.model.clear_cache()" ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()" Feb 21, 2025
@MeetThePatel
Copy link

Would you be able to provide more context regarding the block of code that is OOMing?

For BPE:

  • During training: you are just doing merging in a deterministic fashion. source
  • During tokenization: you are applying your learned merge rules, which can be saved to a cache for tokens you have already "built". source

For WordPiece:

  • During training: you are doing BPE, which doesn't use cache. source
  • During tokenization: you don't need a cache, as you are just doing matching in a greedy fashion by search for largest substring that is in the vocab. source

Which leads me to believe that either:

  1. There is a problem with the surrounding code.
  2. The vocab you are trying to load is too large for your machine. This seems less likely, since BERTTokenizer has 30k only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants