Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative decoding: Base implementation #1938

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

Andrei-Aksionov
Copy link
Contributor

@Andrei-Aksionov Andrei-Aksionov commented Feb 16, 2025

Hey there 👋

This PR includes base implementation of a speculative decoding, the one that was proposed in Fast Inference from Transformers via Speculative Decoding.

The focus was on implementing a working solution without focusing on optimizations.
For instance, this doesn't support batched inference.
It should be added in next PR/PRs.


TODO:

  • 1. Add tests
  • 2. Check on a GPU that all tensors are placed correctly
  • [❌] 3. Provide initial numbers of tokens/second. Will be a baseline for future optimizations
    Don't have access to all Llama models. Other models have different vocab sizes between different model sizes.
    For the method to properly work, we need:
    a) significant size difference (> 10x)
    b) the same vocabulary for both models
    c) target model has to accept draft tokens most of the time

@Andrei-Aksionov
Copy link
Contributor Author

Basically, it's ready for a review.

Tests are failing because there is an error with loading a pythia-14m model (is used for tests) and all tokenizers. Something is wrong on HF side. Will rerun tests tomorrow.

As for numbers, Ideally I wanted to test Llama 1b and 8b models, but don't have access to 8b model repo 😞.
Qwen has lots of different sizes, but their vocabs differs between different sizes, so that doesn't fit the bill.

@Andrei-Aksionov Andrei-Aksionov marked this pull request as ready for review February 17, 2025 19:24
@Andrei-Aksionov
Copy link
Contributor Author

Oddly enough, these tests pass locally, in a Studio and in a fork.
I know that HF token for this repo was initially provided by Carlos, then Sebastian used his one.
Now @lantiga or @t-vi your turn :)

@t-vi
Copy link
Collaborator

t-vi commented Feb 18, 2025

Hi Andrei, thank you. I'll take a look.

@t-vi
Copy link
Collaborator

t-vi commented Feb 18, 2025

So the HF key is working again (thank you @Borda ), but there are two failures left.

@Andrei-Aksionov Andrei-Aksionov mentioned this pull request Feb 19, 2025
@Borda
Copy link
Member

Borda commented Feb 27, 2025

So the HF key is working again (thank you @Borda ), but there are two failures left.

looking into it and will debug it in #1940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants