Speed Leaderboard

Loading speed data...

About the Speed Tests

Every .gguf on this site is benchmarked with llama-bench from llama.cpp using identical parameters across every model and every GPU, so cross-GPU and cross-quant comparisons are apples-to-apples. Each model is run through five test variants at fourteen context depths from 0 to 131,072 tokens.

GPU Ranking panel

The Ranking tab compares GPUs against a fixed 12-model panel chosen to cover a range of sizes, families, and architectures (dense + MoE) up to ~12 GB so the set fits on every modern card. Each GPU's composite score is the geometric mean of its tokens/sec across those 12 models. A GPU only appears in the ranking if it has llama-bench data for every model in the panel. Token Generation drives the sort; Prefill is shown on a second axis.

Test variants

Display Name	Prefill	Generation	What it measures
Token Generation	—	128 tok	Pure steady-state output speed. Tokens per second the model emits once it has started generating. This is what you feel when waiting for a reply.
Prefill Speed	1,024 tok	—	Pure prompt-processing throughput. How fast the model can ingest your input before it produces the first token of output. Drives time-to-first-token for long inputs.
Agentic Small	256 tok	256 tok	Short prompt in, short reply out. Models a trivial tool reply, a simple chat turn, or a small classifier-style call.
Agentic Medium	1,024 tok	256 tok	A typical agent step: system prompt + tool output + scratchpad fit in roughly 1k tokens, the model writes a short reasoning step and a tool call.
Agentic Large	4,096 tok	256 tok	A heavy agent step with a long accumulated history of tool outputs. Still emitting a relatively short reply.

Context Size

The Context Size filter (and the X axis of the Context Scaling tab) is the amount of prior conversation already loaded in the KV cache before the variant runs. The sweep covers 0, 1k, 2k, 4k, 6k, 8k, 12k, 16k, 24k, 32k, 48k, 64k, 96k, and 128k tokens. Throughput typically falls as context grows — comparing context 0 to context 128k shows how well a model holds up at long context.

Run parameters

-ngl 999 — every layer offloaded to the GPU. No CPU spillover, so the numbers reflect the GPU alone.
-fa 1 — Flash Attention enabled.
-ctk q8_0 -ctv q8_0 — KV cache stored in 8-bit quantized form.
-n_batch 2048, -n_ubatch 512 — standard batching defaults.
Each measurement is averaged over multiple samples; llama-bench reports both avg_ts and a standard deviation per row.

Why some curves stop early

Runs are intentionally pushed until they hit a CUDA out-of-memory error. A line that terminates before 128k context is therefore not missing data — it shows the practical maximum context for that (GPU, model, quant) combination under these run parameters. A large model on a small card will fall off early; the same model on a 5090 may make it to 128k.