Every .gguf on this site is benchmarked with llama-bench from
llama.cpp
using identical parameters across every model and every GPU, so cross-GPU and
cross-quant comparisons are apples-to-apples. Each model is run through five test
variants at fourteen context depths from 0 to 131,072 tokens.
The Ranking tab compares GPUs against a fixed 12-model panel chosen to cover a range of sizes, families, and architectures (dense + MoE) up to ~12 GB so the set fits on every modern card. Each GPU's composite score is the geometric mean of its tokens/sec across those 12 models. A GPU only appears in the ranking if it has llama-bench data for every model in the panel. Token Generation drives the sort; Prefill is shown on a second axis.
| Display Name | Prefill | Generation | What it measures |
|---|---|---|---|
| Token Generation | — | 128 tok | Pure steady-state output speed. Tokens per second the model emits once it has started generating. This is what you feel when waiting for a reply. |
| Prefill Speed | 1,024 tok | — | Pure prompt-processing throughput. How fast the model can ingest your input before it produces the first token of output. Drives time-to-first-token for long inputs. |
| Agentic Small | 256 tok | 256 tok | Short prompt in, short reply out. Models a trivial tool reply, a simple chat turn, or a small classifier-style call. |
| Agentic Medium | 1,024 tok | 256 tok | A typical agent step: system prompt + tool output + scratchpad fit in roughly 1k tokens, the model writes a short reasoning step and a tool call. |
| Agentic Large | 4,096 tok | 256 tok | A heavy agent step with a long accumulated history of tool outputs. Still emitting a relatively short reply. |
The Context Size filter (and the X axis of the Context Scaling tab) is the amount of prior conversation already loaded in the KV cache before the variant runs. The sweep covers 0, 1k, 2k, 4k, 6k, 8k, 12k, 16k, 24k, 32k, 48k, 64k, 96k, and 128k tokens. Throughput typically falls as context grows — comparing context 0 to context 128k shows how well a model holds up at long context.
-ngl 999 — every layer offloaded to the GPU. No CPU spillover, so the
numbers reflect the GPU alone.-fa 1 — Flash Attention enabled.-ctk q8_0 -ctv q8_0 — KV cache stored in 8-bit quantized form.-n_batch 2048, -n_ubatch 512 — standard batching defaults.llama-bench
reports both avg_ts and a standard deviation per row.Runs are intentionally pushed until they hit a CUDA out-of-memory error. A line that terminates before 128k context is therefore not missing data — it shows the practical maximum context for that (GPU, model, quant) combination under these run parameters. A large model on a small card will fall off early; the same model on a 5090 may make it to 128k.