-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative decoding: Base implementation #1938
base: main
Are you sure you want to change the base?
Speculative decoding: Base implementation #1938
Conversation
Basically, it's ready for a review. Tests are failing because there is an error with loading a pythia-14m model (is used for tests) and all tokenizers. Something is wrong on HF side. Will rerun tests tomorrow. As for numbers, Ideally I wanted to test Llama 1b and 8b models, but don't have access to 8b model repo 😞. |
Hi Andrei, thank you. I'll take a look. |
So the HF key is working again (thank you @Borda ), but there are two failures left. |
Hey there 👋
This PR includes base implementation of a speculative decoding, the one that was proposed in Fast Inference from Transformers via Speculative Decoding.
The focus was on implementing a working solution without focusing on optimizations.
For instance, this doesn't support batched inference.
It should be added in next PR/PRs.
TODO:
Don't have access to all Llama models. Other models have different vocab sizes between different model sizes.
For the method to properly work, we need:
a) significant size difference (> 10x)
b) the same vocabulary for both models
c) target model has to accept draft tokens most of the time