This page covers vLLM's built-in benchmarking framework: the dataset abstraction layer, online serving benchmarks, offline throughput benchmarks, latency benchmarks, and the endpoint request function abstraction. For information about the OpenAI-compatible server being benchmarked, see 6.1. For metrics exposed at runtime by the server, see 3.6.
The benchmarking framework lives under vllm/benchmarks/ and is exposed through the vllm bench CLI subcommand. Three primary benchmark modes are provided:
| CLI command | Module | What it measures |
|---|---|---|
vllm bench serve | vllm/benchmarks/serve.py | Online serving: TTFT, TPOT, ITL, E2EL, throughput |
vllm bench throughput | vllm/benchmarks/throughput.py | Offline batch throughput: tokens/s, requests/s |
vllm bench latency | vllm/benchmarks/latency.py | Single-batch latency over repeated iterations |
Legacy scripts at benchmarks/benchmark_serving.py, benchmarks/benchmark_throughput.py, and benchmarks/benchmark_latency.py print deprecation notices and exit with code 1, redirecting users to the CLI equivalents.
Architecture diagram:
Sources: vllm/benchmarks/serve.py1-60 vllm/benchmarks/throughput.py1-50 vllm/benchmarks/latency.py1-20 benchmarks/benchmark_serving.py1-17
All benchmark inputs are represented as SampleRequest instances sampled from a BenchmarkDataset subclass.
SampleRequestDefined in vllm/benchmarks/datasets.py71-83
SampleRequest
prompt: str | list[str]
prompt_len: int
expected_output_len: int
multi_modal_data: MultiModalDataDict | dict | list[dict] | None
lora_request: LoRARequest | None
request_id: str | None
prompt can be a list to support batched embedding/reranking requests. multi_modal_data carries image/video content in OpenAI API format.
BenchmarkDataset Base Classvllm/benchmarks/datasets.py90-253
The abstract base class defines:
load_data() — must be implemented by subclasses to populate self.data.sample(tokenizer, num_requests, ...) → list[SampleRequest] — abstract; subclasses implement dataset-specific sampling logic.maybe_oversample_requests(requests, num_requests, ...) — if the dataset produces fewer than num_requests samples, copies are randomly drawn to reach the target count.get_random_lora_request(max_loras, lora_path) — optionally attaches a LoRARequest to each sample.apply_multimodal_chat_transformation(prompt, mm_content) — converts text + multimodal content into a chat-format message list.Sources: vllm/benchmarks/datasets.py90-253 vllm/benchmarks/throughput.py19-35
| Dataset class | --dataset-name / path | Key parameters |
|---|---|---|
RandomDataset | random | input_len, output_len, prefix_len, range_ratio |
RandomMultiModalDataset | random-mm | bucket_config, base_items_per_request, limit_mm_per_prompt |
RandomDatasetForReranking | random-rerank | batchsize, is_reranker |
ShareGPTDataset | sharegpt | output_len (optional override) |
SonnetDataset | sonnet | prefix_len, input_len, output_len |
BurstGPTDataset | burstgpt | — |
VisionArenaDataset | hf + specific path | enable_multimodal_chat |
InstructCoderDataset | hf + specific path | — |
ConversationDataset | hf + specific path | dataset_subset, dataset_split |
MultiModalConversationDataset | hf + specific path | enable_multimodal_chat |
AIMODataset | hf + specific path | — |
PrefixRepetitionRandomDataset | prefix_repetition | prefix_len, suffix_len, num_prefixes, output_len |
Sources: vllm/benchmarks/throughput.py338-476 vllm/benchmarks/datasets.py443-1200
RandomDataset InternalsRandomDataset uses numpy.default_rng seeded with random_seed for isolation from global RNG state. Token generation works as follows:
get_sampling_params() draws uniform integer lengths for input and output from [floor(len*(1-r)), ceil(len*(1+r))].prefix_len tokens is generated once via get_prefix().allowed_tokens[(offset + index + arange(input_len)) % len(allowed_tokens)].gen_prompt_decode_to_target_len() decodes the token sequence to a string and re-encodes it up to 10 times, adjusting length, to compensate for tokenizer non-bijections.Sources: vllm/benchmarks/datasets.py443-688
Two functions convert raw media into the OpenAI chat API format:
process_image(image) — accepts PIL.Image, path/URL string, or {"bytes": ...} dict; returns {"type": "image_url", "image_url": {"url": ...}}.process_video(video) — accepts path/URL string or {"bytes": ...} dict; returns {"type": "video_url", "video_url": {"url": ...}}.RandomMultiModalDataset.generate_mm_item() calls these after generating synthetic pixel data via generate_synthetic_image() or generate_synthetic_video() (using OpenCV).
Sources: vllm/benchmarks/datasets.py296-379 vllm/benchmarks/datasets.py852-964
vllm/benchmarks/lib/endpoint_request_func.py provides a protocol-based abstraction for communicating with different inference backends during online benchmarks.
| Class | Purpose |
|---|---|
RequestFuncInput | Input to any request function: prompt, api_url, prompt_len, output_len, model, multi_modal_content, etc. |
RequestFuncOutput | Output with timing: generated_text, success, latency, ttft, itl (list), tpot, output_tokens, error, start_time |
RequestFunc | Protocol: (RequestFuncInput, ClientSession, tqdm?) → Awaitable[RequestFuncOutput] |
StreamedResponseHandler | Accumulates SSE chunks into complete messages by buffering until \n\n separators |
Sources: vllm/benchmarks/lib/endpoint_request_func.py63-106
ASYNC_REQUEST_FUNCS RegistryA dict mapping backend name string to an async request function:
Sources: vllm/benchmarks/lib/endpoint_request_func.py152-810
All streaming functions record TTFT on the first chunk and append per-token inter-token latencies (itl) on each subsequent chunk. The final output.latency is last_chunk_timestamp - start.
vllm bench serve)Module: vllm/benchmarks/serve.py
get_request() is an async generator that emits (SampleRequest, current_rate) tuples. Inter-request delay is sampled from a Gamma distribution:
burstiness=1.0 → exponential inter-arrivals (Poisson process)burstiness < 1 → more burstyburstiness > 1 → more uniformOptionally a ramp-up strategy (linear or exponential) can be applied to increase RPS from ramp_up_start_rps to ramp_up_end_rps over the duration of the benchmark. When no ramp-up is in use, total delay is rescaled to match num_requests / request_rate exactly.
Sources: vllm/benchmarks/serve.py217-339
benchmark() Function FlowSources: vllm/benchmarks/serve.py603-900
BenchmarkMetricsComputed by calculate_metrics() in vllm/benchmarks/serve.py391-600:
| Field | Description |
|---|---|
completed / failed | Count of succeeded/failed requests |
total_input / total_output | Sum of input/output token counts |
request_throughput | completed / dur_s |
request_goodput | Requests meeting SLO constraints / dur_s |
output_throughput | total_output / dur_s |
total_token_throughput | (total_input + total_output) / dur_s |
mean/median/std/percentiles_ttft_ms | Time to first token distribution (ms) |
mean/median/std/percentiles_tpot_ms | Time per output token (ms) |
mean/median/std/percentiles_itl_ms | Inter-token latency distribution (ms) |
mean/median/std/percentiles_e2el_ms | End-to-end latency (ms) |
max_output_tokens_per_s | Peak tokens/s in any single 1-second bucket |
max_concurrent_requests | Peak concurrent requests in any 1-second window |
rtfx | Inverse Real-Time Factor (input_audio_duration / dur_s) for ASR |
Goodput is computed by checking each request's ttft, tpot, and/or e2el against SLO thresholds specified via --goodput.
For pooling endpoints, calculate_metrics_for_embeddings() produces EmbedBenchmarkMetrics with completed, failed, total_input, request_throughput, total_token_throughput, and E2EL statistics (no TTFT/TPOT/ITL).
Sources: vllm/benchmarks/serve.py169-215 vllm/benchmarks/serve.py342-388
If the server has speculative decoding enabled, fetch_spec_decode_metrics() scrapes the /metrics Prometheus endpoint and returns a SpecDecodeMetrics dataclass with num_drafts, num_draft_tokens, num_accepted_tokens, and accepted_per_pos.
Sources: vllm/benchmarks/serve.py94-161
vllm bench throughput)Module: vllm/benchmarks/throughput.py
Three execution backends are supported:
| Function | Backend | Notes |
|---|---|---|
run_vllm() | vllm | Uses LLM.generate() or LLM.beam_search() synchronously |
run_vllm_chat() | vllm-chat | Uses LLM.chat(), intended for multimodal models |
run_vllm_async() | vllm-async | Uses AsyncLLM via build_async_engine_client_from_engine_args |
run_hf() | hf | Uses AutoModelForCausalLM directly for comparison |
run_vllm() adds all requests to the engine at once and measures time.perf_counter() around llm.generate(). It constructs TextPrompt or TokensPrompt objects depending on whether the sample includes prompt_token_ids. LoRA requests from SampleRequest.lora_request are passed through when engine_args.enable_lora is set.
run_vllm_async() submits requests concurrently via merge_async_iterators() and consumes all outputs before stopping the timer.
Result output includes requests_per_second and tokens_per_second, and can be written to JSON via --output-json. save_to_pytorch_benchmark_format() additionally converts results to the PyTorch benchmark format.
Sources: vllm/benchmarks/throughput.py46-248
get_requests)get_requests() in vllm/benchmarks/throughput.py338-476 maps args.dataset_name to the correct BenchmarkDataset subclass, applies dataset-specific keyword arguments, then calls .sample(). For hf datasets, the path is checked against SUPPORTED_DATASET_PATHS class attributes on each dataset class to select the correct subclass automatically.
filter_requests_for_dp() filters the request list to ensure divisibility by data_parallel_size for data-parallel execution modes.
Sources: vllm/benchmarks/throughput.py338-484
vllm bench latency)Module: vllm/benchmarks/latency.py
Measures end-to-end latency of a single fixed-size batch over repeated iterations.
Key CLI arguments:
| Argument | Default | Description |
|---|---|---|
--input-len | 32 | Prompt token count |
--output-len | 128 | Output token count |
--batch-size | 8 | Number of prompts per batch |
--num-iters-warmup | 10 | Warmup iterations (excluded from stats) |
--num-iters | 30 | Benchmark iterations |
--profile | False | Enables torch/CUDA profiler for one iteration |
Prefix caching is disabled by default (parser.set_defaults(enable_prefix_caching=False)) to avoid cache-skewed results vllm/benchmarks/latency.py77
run_to_completion() wraps llm.generate() or llm.beam_search() with time.perf_counter(). After warmup, num_iters latency samples are collected and percentiles at [10, 25, 50, 75, 90, 99] are printed. Results can be written to a JSON file.
Sources: vllm/benchmarks/latency.py80-172
vllm/benchmarks/lib/ready_checker.py provides wait_for_endpoint(), which polls a RequestFunc until it returns a successful response or a timeout expires. It uses a tqdm progress bar showing elapsed/remaining time and retries every retry_interval seconds (default: 5s). The default timeout is 600 seconds.
This is called by benchmark() in serve.py before the warmup phase to ensure the server is ready.
Sources: vllm/benchmarks/lib/ready_checker.py18-79
Both serve.py and throughput.py support --output-json to write full result dictionaries. The JSON includes aggregate statistics, per-request latencies, input/output lengths, and errors.
convert_to_pytorch_benchmark_format() from vllm/benchmarks/lib/utils.py converts results into PyTorch benchmark format for integration with CI tooling. The .pytorch.json file is written alongside the main JSON file.
vllm/benchmarks/plot.py provides generate_timeline_plot(), which generates an interactive HTML Plotly timeline from per-request result dicts. Each request is rendered as a row with a TTFT segment followed by ITL segments color-categorized by threshold (e.g., < 25ms, 25–50ms, ≥ 50ms). The output is an HTML file suitable for browser viewing.
construct_timeline_data() converts the raw start_time, ttft, itl[], and latency fields into Gantt-chart-style records with ISO-formatted timestamps.
Sources: vllm/benchmarks/plot.py25-240
If termplotlib and gnuplot are installed, calculate_metrics() in serve.py renders ASCII plots of tokens-per-second and concurrent-requests-per-second over time directly in the terminal vllm/benchmarks/serve.py543-559
vllm/benchmarks/sweep/ contains helpers for running parameter sweeps:
server.py — ServerProcess context manager that launches a vllm serve subprocess, waits for readiness, resets caches between benchmark runs via /reset_prefix_cache, /reset_mm_cache, and /reset_encoder_cache endpoints, and terminates the process group on exit.utils.py — sanitize_filename() for safe output file naming.Sources: vllm/benchmarks/sweep/server.py14-143
Sources: vllm/benchmarks/serve.py391-600 vllm/benchmarks/throughput.py46-248 vllm/benchmarks/datasets.py186-253
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.