Speculative decoding: Base implementation #1938

Andrei-Aksionov · 2025-02-16T16:25:06Z

Hey there 👋

This PR includes base implementation of a speculative decoding, the one that was proposed in Fast Inference from Transformers via Speculative Decoding.

The focus was on implementing a working solution without focusing on optimizations.
For instance, this doesn't support batched inference.
It should be added in next PR/PRs.

TODO:

1. Add tests
2. Check on a GPU that all tensors are placed correctly
[❌] 3. Provide initial numbers of tokens/second. Will be a baseline for future optimizations
Don't have access to all Llama models. Other models have different vocab sizes between different model sizes.
For the method to properly work, we need:
a) significant size difference (> 10x)
b) the same vocabulary for both models
c) target model has to accept draft tokens most of the time

Andrei-Aksionov · 2025-02-17T19:24:18Z

Basically, it's ready for a review.

Tests are failing because there is an error with loading a pythia-14m model (is used for tests) and all tokenizers. Something is wrong on HF side. Will rerun tests tomorrow.

As for numbers, Ideally I wanted to test Llama 1b and 8b models, but don't have access to 8b model repo 😞.
Qwen has lots of different sizes, but their vocabs differs between different sizes, so that doesn't fit the bill.

Andrei-Aksionov · 2025-02-18T12:18:15Z

Oddly enough, these tests pass locally, in a Studio and in a fork.
I know that HF token for this repo was initially provided by Carlos, then Sebastian used his one.
Now @lantiga or @t-vi your turn :)

t-vi · 2025-02-18T12:48:13Z

Hi Andrei, thank you. I'll take a look.

t-vi · 2025-02-18T14:05:09Z

So the HF key is working again (thank you @Borda ), but there are two failures left.

This reverts commit 0982fe7.

This reverts commit 2a8901c.

Borda · 2025-02-27T12:41:31Z

So the HF key is working again (thank you @Borda ), but there are two failures left.

looking into it and will debug it in #1940

Andrei Aksionau and others added 19 commits February 12, 2025 19:28

Initial version

35e1627

Looks like a working solution (ain't pretty).

cdfaff6

Fill-up cache with the latest draft token (if not rejected)

1626493

Sample very last token if all are accepted

63ef979

Print acceptance rate

b80c03c

Remove commented out code

748b5bf

Refactoring

f93f1b9

Add/Update annotation for speculative decoding function

826fe38

Refactoring: main generate function

eca9dc7

Drop intermediate generate function

1831b8f

Refactor main function

42f775d

Dosctring update

d8e7954

Add speculative generation command to litgpt

b1bceb4

Docstrings updates

897e9b1

Update test_cli.py

9a302d2

Test generate

63ee221

main and CLI tests

295da16

Tests for speculative decoding

b45ae85

Trigger workflow rerun

2884251

Andrei-Aksionov marked this pull request as ready for review February 17, 2025 19:24

Andrei-Aksionov requested review from lantiga and t-vi as code owners February 17, 2025 19:24

Trigger workflow rerun

483feed

Empty-Commit

0a5e13d

Andrei-Aksionov and others added 2 commits February 18, 2025 18:16

Merge branch 'main' into speculative_decoding_base_implementation

1b4d974

GitHub workflow: clear HF cache before running tests

0982fe7

Andrei-Aksionov added 3 commits February 18, 2025 18:49

Revert "GitHub workflow: clear HF cache before running tests"

3d25eb9

This reverts commit 0982fe7.

Restore tests to the state when CI passed last time

2a8901c

Revert "Restore tests to the state when CI passed last time"

9ff8610

This reverts commit 2a8901c.

Andrei-Aksionov mentioned this pull request Feb 19, 2025

nits for CI #1940

Merged

Borda and others added 2 commits March 11, 2025 23:41

Merge branch 'main' into speculative_decoding_base_implementation

bf74e62

_RunIf

5f9790f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative decoding: Base implementation #1938

Speculative decoding: Base implementation #1938

Andrei-Aksionov commented Feb 16, 2025 •

edited

Loading

Andrei-Aksionov commented Feb 17, 2025

Andrei-Aksionov commented Feb 18, 2025

t-vi commented Feb 18, 2025

t-vi commented Feb 18, 2025

Borda commented Feb 27, 2025

Speculative decoding: Base implementation #1938

Are you sure you want to change the base?

Speculative decoding: Base implementation #1938

Conversation

Andrei-Aksionov commented Feb 16, 2025 • edited Loading

Andrei-Aksionov commented Feb 17, 2025

Andrei-Aksionov commented Feb 18, 2025

t-vi commented Feb 18, 2025

t-vi commented Feb 18, 2025

Borda commented Feb 27, 2025

Andrei-Aksionov commented Feb 16, 2025 •

edited

Loading