You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I met an OOM problem when using BertTokenizer as #1539 reports.
Then I use tokenizer._tokenizer.model.clear_cache() or tokenizer._tokenizer.model._clear_cache() to clear cache.
However, I met an error: AttributeError: 'tokenizers.models.WordPiece' object has no attribute 'clear_cache', could anyone tell me how to fix it?
In the source code, It seems like clear_cache only supports BPE and Unigram tokenizer, not wordpiece tokenizer, is it the reason? if it is, could anyone give me some advice to fix this problem?
environment:
run on linux with only cpu
tokenizers==0.21.0
transformers==4.49.0
The text was updated successfully, but these errors were encountered:
nixonjin
changed the title
ERROR occur where running "tokenizer._tokenizer.model.clear_cache()"
ERROR occurs where running "tokenizer._tokenizer.model.clear_cache()"
Feb 21, 2025
nixonjin
changed the title
ERROR occurs where running "tokenizer._tokenizer.model.clear_cache()"
ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()"
Feb 21, 2025
Would you be able to provide more context regarding the block of code that is OOMing?
For BPE:
During training: you are just doing merging in a deterministic fashion. source
During tokenization: you are applying your learned merge rules, which can be saved to a cache for tokens you have already "built". source
For WordPiece:
During training: you are doing BPE, which doesn't use cache. source
During tokenization: you don't need a cache, as you are just doing matching in a greedy fashion by search for largest substring that is in the vocab. source
Which leads me to believe that either:
There is a problem with the surrounding code.
The vocab you are trying to load is too large for your machine. This seems less likely, since BERTTokenizer has 30k only.
I met an OOM problem when using BertTokenizer as #1539 reports.
Then I use tokenizer._tokenizer.model.clear_cache() or tokenizer._tokenizer.model._clear_cache() to clear cache.
However, I met an error: AttributeError: 'tokenizers.models.WordPiece' object has no attribute 'clear_cache', could anyone tell me how to fix it?
In the source code, It seems like clear_cache only supports BPE and Unigram tokenizer, not wordpiece tokenizer, is it the reason? if it is, could anyone give me some advice to fix this problem?
environment:
run on linux with only cpu
tokenizers==0.21.0
transformers==4.49.0
The text was updated successfully, but these errors were encountered: