Language & LLMs

What Is Speculative Decoding?

Speculative decoding accelerates text generation by using a small, fast model to propose several candidate tokens, which a larger model then verifies in parallel. Accepted tokens are kept, and rejected ones are recomputed by the large model. This reduces latency while producing the same output distribution as the large model alone.

Further reading

Read more about speculative decoding — articles and blogs from around the web: