Benchmarking

Relevant source files

This page covers vLLM's built-in benchmarking framework: the dataset abstraction layer, online serving benchmarks, offline throughput benchmarks, latency benchmarks, and the endpoint request function abstraction. For information about the OpenAI-compatible server being benchmarked, see 6.1. For metrics exposed at runtime by the server, see 3.6.

Overview

The benchmarking framework lives under vllm/benchmarks/ and is exposed through the vllm bench CLI subcommand. Three primary benchmark modes are provided:

CLI command	Module	What it measures
`vllm bench serve`	`vllm/benchmarks/serve.py`	Online serving: TTFT, TPOT, ITL, E2EL, throughput
`vllm bench throughput`	`vllm/benchmarks/throughput.py`	Offline batch throughput: tokens/s, requests/s
`vllm bench latency`	`vllm/benchmarks/latency.py`	Single-batch latency over repeated iterations

Legacy scripts at benchmarks/benchmark_serving.py, benchmarks/benchmark_throughput.py, and benchmarks/benchmark_latency.py print deprecation notices and exit with code 1, redirecting users to the CLI equivalents.

Architecture diagram:

Sources: vllm/benchmarks/serve.py1-60 vllm/benchmarks/throughput.py1-50 vllm/benchmarks/latency.py1-20 benchmarks/benchmark_serving.py1-17

Dataset Abstraction

All benchmark inputs are represented as SampleRequest instances sampled from a BenchmarkDataset subclass.

`SampleRequest`

Defined in vllm/benchmarks/datasets.py71-83

SampleRequest
  prompt: str | list[str]
  prompt_len: int
  expected_output_len: int
  multi_modal_data: MultiModalDataDict | dict | list[dict] | None
  lora_request: LoRARequest | None
  request_id: str | None

prompt can be a list to support batched embedding/reranking requests. multi_modal_data carries image/video content in OpenAI API format.

`BenchmarkDataset` Base Class

vllm/benchmarks/datasets.py90-253

The abstract base class defines:

load_data() — must be implemented by subclasses to populate self.data.
sample(tokenizer, num_requests, ...) → list[SampleRequest] — abstract; subclasses implement dataset-specific sampling logic.
maybe_oversample_requests(requests, num_requests, ...) — if the dataset produces fewer than num_requests samples, copies are randomly drawn to reach the target count.
get_random_lora_request(max_loras, lora_path) — optionally attaches a LoRARequest to each sample.
apply_multimodal_chat_transformation(prompt, mm_content) — converts text + multimodal content into a chat-format message list.

Supported Dataset Classes

Sources: vllm/benchmarks/datasets.py90-253 vllm/benchmarks/throughput.py19-35

Dataset class	`--dataset-name` / path	Key parameters
`RandomDataset`	`random`	`input_len`, `output_len`, `prefix_len`, `range_ratio`
`RandomMultiModalDataset`	`random-mm`	`bucket_config`, `base_items_per_request`, `limit_mm_per_prompt`
`RandomDatasetForReranking`	`random-rerank`	`batchsize`, `is_reranker`
`ShareGPTDataset`	`sharegpt`	`output_len` (optional override)
`SonnetDataset`	`sonnet`	`prefix_len`, `input_len`, `output_len`
`BurstGPTDataset`	`burstgpt`	—
`VisionArenaDataset`	`hf` + specific path	`enable_multimodal_chat`
`InstructCoderDataset`	`hf` + specific path	—
`ConversationDataset`	`hf` + specific path	`dataset_subset`, `dataset_split`
`MultiModalConversationDataset`	`hf` + specific path	`enable_multimodal_chat`
`AIMODataset`	`hf` + specific path	—
`PrefixRepetitionRandomDataset`	`prefix_repetition`	`prefix_len`, `suffix_len`, `num_prefixes`, `output_len`

Sources: vllm/benchmarks/throughput.py338-476 vllm/benchmarks/datasets.py443-1200

`RandomDataset` Internals

RandomDataset uses numpy.default_rng seeded with random_seed for isolation from global RNG state. Token generation works as follows:

get_sampling_params() draws uniform integer lengths for input and output from [floor(len*(1-r)), ceil(len*(1+r))].
A shared prefix of prefix_len tokens is generated once via get_prefix().
Each request's non-prefix tokens are sampled deterministically: allowed_tokens[(offset + index + arange(input_len)) % len(allowed_tokens)].
gen_prompt_decode_to_target_len() decodes the token sequence to a string and re-encodes it up to 10 times, adjusting length, to compensate for tokenizer non-bijections.

Sources: vllm/benchmarks/datasets.py443-688

Multimodal Content Helpers

Two functions convert raw media into the OpenAI chat API format:

process_image(image) — accepts PIL.Image, path/URL string, or {"bytes": ...} dict; returns {"type": "image_url", "image_url": {"url": ...}}.
process_video(video) — accepts path/URL string or {"bytes": ...} dict; returns {"type": "video_url", "video_url": {"url": ...}}.

RandomMultiModalDataset.generate_mm_item() calls these after generating synthetic pixel data via generate_synthetic_image() or generate_synthetic_video() (using OpenCV).

Sources: vllm/benchmarks/datasets.py296-379 vllm/benchmarks/datasets.py852-964

Endpoint Request Function Abstraction

vllm/benchmarks/lib/endpoint_request_func.py provides a protocol-based abstraction for communicating with different inference backends during online benchmarks.

Core Data Classes

Class	Purpose
`RequestFuncInput`	Input to any request function: `prompt`, `api_url`, `prompt_len`, `output_len`, `model`, `multi_modal_content`, etc.
`RequestFuncOutput`	Output with timing: `generated_text`, `success`, `latency`, `ttft`, `itl` (list), `tpot`, `output_tokens`, `error`, `start_time`
`RequestFunc`	Protocol: `(RequestFuncInput, ClientSession, tqdm?) → Awaitable[RequestFuncOutput]`
`StreamedResponseHandler`	Accumulates SSE chunks into complete messages by buffering until `\n\n` separators

Sources: vllm/benchmarks/lib/endpoint_request_func.py63-106

`ASYNC_REQUEST_FUNCS` Registry

A dict mapping backend name string to an async request function:

Sources: vllm/benchmarks/lib/endpoint_request_func.py152-810

All streaming functions record TTFT on the first chunk and append per-token inter-token latencies (itl) on each subsequent chunk. The final output.latency is last_chunk_timestamp - start.

Online Serving Benchmark (`vllm bench serve`)

Module: vllm/benchmarks/serve.py

Request Rate Control

get_request() is an async generator that emits (SampleRequest, current_rate) tuples. Inter-request delay is sampled from a Gamma distribution:

burstiness=1.0 → exponential inter-arrivals (Poisson process)
burstiness < 1 → more bursty
burstiness > 1 → more uniform

Optionally a ramp-up strategy (linear or exponential) can be applied to increase RPS from ramp_up_start_rps to ramp_up_end_rps over the duration of the benchmark. When no ramp-up is in use, total delay is rescaled to match num_requests / request_rate exactly.

Sources: vllm/benchmarks/serve.py217-339

`benchmark()` Function Flow

Sources: vllm/benchmarks/serve.py603-900

`BenchmarkMetrics`

Computed by calculate_metrics() in vllm/benchmarks/serve.py391-600:

Field	Description
`completed` / `failed`	Count of succeeded/failed requests
`total_input` / `total_output`	Sum of input/output token counts
`request_throughput`	`completed / dur_s`
`request_goodput`	Requests meeting SLO constraints / `dur_s`
`output_throughput`	`total_output / dur_s`
`total_token_throughput`	`(total_input + total_output) / dur_s`
`mean/median/std/percentiles_ttft_ms`	Time to first token distribution (ms)
`mean/median/std/percentiles_tpot_ms`	Time per output token (ms)
`mean/median/std/percentiles_itl_ms`	Inter-token latency distribution (ms)
`mean/median/std/percentiles_e2el_ms`	End-to-end latency (ms)
`max_output_tokens_per_s`	Peak tokens/s in any single 1-second bucket
`max_concurrent_requests`	Peak concurrent requests in any 1-second window
`rtfx`	Inverse Real-Time Factor (`input_audio_duration / dur_s`) for ASR

Goodput is computed by checking each request's ttft, tpot, and/or e2el against SLO thresholds specified via --goodput.

Embedding Benchmark Metrics

For pooling endpoints, calculate_metrics_for_embeddings() produces EmbedBenchmarkMetrics with completed, failed, total_input, request_throughput, total_token_throughput, and E2EL statistics (no TTFT/TPOT/ITL).

Sources: vllm/benchmarks/serve.py169-215 vllm/benchmarks/serve.py342-388

Speculative Decoding Metrics

If the server has speculative decoding enabled, fetch_spec_decode_metrics() scrapes the /metrics Prometheus endpoint and returns a SpecDecodeMetrics dataclass with num_drafts, num_draft_tokens, num_accepted_tokens, and accepted_per_pos.

Sources: vllm/benchmarks/serve.py94-161

Offline Throughput Benchmark (`vllm bench throughput`)

Module: vllm/benchmarks/throughput.py

Three execution backends are supported:

Function	Backend	Notes
`run_vllm()`	`vllm`	Uses `LLM.generate()` or `LLM.beam_search()` synchronously
`run_vllm_chat()`	`vllm-chat`	Uses `LLM.chat()`, intended for multimodal models
`run_vllm_async()`	`vllm-async`	Uses `AsyncLLM` via `build_async_engine_client_from_engine_args`
`run_hf()`	`hf`	Uses `AutoModelForCausalLM` directly for comparison

run_vllm() adds all requests to the engine at once and measures time.perf_counter() around llm.generate(). It constructs TextPrompt or TokensPrompt objects depending on whether the sample includes prompt_token_ids. LoRA requests from SampleRequest.lora_request are passed through when engine_args.enable_lora is set.

run_vllm_async() submits requests concurrently via merge_async_iterators() and consumes all outputs before stopping the timer.

Result output includes requests_per_second and tokens_per_second, and can be written to JSON via --output-json. save_to_pytorch_benchmark_format() additionally converts results to the PyTorch benchmark format.

Sources: vllm/benchmarks/throughput.py46-248

Dataset Selection (`get_requests`)

get_requests() in vllm/benchmarks/throughput.py338-476 maps args.dataset_name to the correct BenchmarkDataset subclass, applies dataset-specific keyword arguments, then calls .sample(). For hf datasets, the path is checked against SUPPORTED_DATASET_PATHS class attributes on each dataset class to select the correct subclass automatically.

filter_requests_for_dp() filters the request list to ensure divisibility by data_parallel_size for data-parallel execution modes.

Sources: vllm/benchmarks/throughput.py338-484

Latency Benchmark (`vllm bench latency`)

Module: vllm/benchmarks/latency.py

Measures end-to-end latency of a single fixed-size batch over repeated iterations.

Configuration

Key CLI arguments:

Argument	Default	Description
`--input-len`	32	Prompt token count
`--output-len`	128	Output token count
`--batch-size`	8	Number of prompts per batch
`--num-iters-warmup`	10	Warmup iterations (excluded from stats)
`--num-iters`	30	Benchmark iterations
`--profile`	False	Enables torch/CUDA profiler for one iteration

Prefix caching is disabled by default (parser.set_defaults(enable_prefix_caching=False)) to avoid cache-skewed results vllm/benchmarks/latency.py77

Execution

run_to_completion() wraps llm.generate() or llm.beam_search() with time.perf_counter(). After warmup, num_iters latency samples are collected and percentiles at [10, 25, 50, 75, 90, 99] are printed. Results can be written to a JSON file.

Sources: vllm/benchmarks/latency.py80-172

Endpoint Readiness Checker

vllm/benchmarks/lib/ready_checker.py provides wait_for_endpoint(), which polls a RequestFunc until it returns a successful response or a timeout expires. It uses a tqdm progress bar showing elapsed/remaining time and retries every retry_interval seconds (default: 5s). The default timeout is 600 seconds.

This is called by benchmark() in serve.py before the warmup phase to ensure the server is ready.

Sources: vllm/benchmarks/lib/ready_checker.py18-79

Result Output and Visualization

JSON Output

Both serve.py and throughput.py support --output-json to write full result dictionaries. The JSON includes aggregate statistics, per-request latencies, input/output lengths, and errors.

convert_to_pytorch_benchmark_format() from vllm/benchmarks/lib/utils.py converts results into PyTorch benchmark format for integration with CI tooling. The .pytorch.json file is written alongside the main JSON file.

Timeline Plot

vllm/benchmarks/plot.py provides generate_timeline_plot(), which generates an interactive HTML Plotly timeline from per-request result dicts. Each request is rendered as a row with a TTFT segment followed by ITL segments color-categorized by threshold (e.g., < 25ms, 25–50ms, ≥ 50ms). The output is an HTML file suitable for browser viewing.

construct_timeline_data() converts the raw start_time, ttft, itl[], and latency fields into Gantt-chart-style records with ISO-formatted timestamps.

Sources: vllm/benchmarks/plot.py25-240

Terminal Plot

If termplotlib and gnuplot are installed, calculate_metrics() in serve.py renders ASCII plots of tokens-per-second and concurrent-requests-per-second over time directly in the terminal vllm/benchmarks/serve.py543-559

Sweep Utilities

vllm/benchmarks/sweep/ contains helpers for running parameter sweeps:

server.py — ServerProcess context manager that launches a vllm serve subprocess, waits for readiness, resets caches between benchmark runs via /reset_prefix_cache, /reset_mm_cache, and /reset_encoder_cache endpoints, and terminates the process group on exit.
utils.py — sanitize_filename() for safe output file naming.

Sources: vllm/benchmarks/sweep/server.py14-143

Key Data Flow Summary

Sources: vllm/benchmarks/serve.py391-600 vllm/benchmarks/throughput.py46-248 vllm/benchmarks/datasets.py186-253

Benchmarking

Relevant source files

Overview

The benchmarking framework lives under vllm/benchmarks/ and is exposed through the vllm bench CLI subcommand. Three primary benchmark modes are provided:

CLI command	Module	What it measures
`vllm bench serve`	`vllm/benchmarks/serve.py`	Online serving: TTFT, TPOT, ITL, E2EL, throughput
`vllm bench throughput`	`vllm/benchmarks/throughput.py`	Offline batch throughput: tokens/s, requests/s
`vllm bench latency`	`vllm/benchmarks/latency.py`	Single-batch latency over repeated iterations

Architecture diagram:

Sources: vllm/benchmarks/serve.py1-60 vllm/benchmarks/throughput.py1-50 vllm/benchmarks/latency.py1-20 benchmarks/benchmark_serving.py1-17

Dataset Abstraction

All benchmark inputs are represented as SampleRequest instances sampled from a BenchmarkDataset subclass.

`SampleRequest`

Defined in vllm/benchmarks/datasets.py71-83

SampleRequest
  prompt: str | list[str]
  prompt_len: int
  expected_output_len: int
  multi_modal_data: MultiModalDataDict | dict | list[dict] | None
  lora_request: LoRARequest | None
  request_id: str | None

prompt can be a list to support batched embedding/reranking requests. multi_modal_data carries image/video content in OpenAI API format.

`BenchmarkDataset` Base Class

vllm/benchmarks/datasets.py90-253

The abstract base class defines:

load_data() — must be implemented by subclasses to populate self.data.
sample(tokenizer, num_requests, ...) → list[SampleRequest] — abstract; subclasses implement dataset-specific sampling logic.
maybe_oversample_requests(requests, num_requests, ...) — if the dataset produces fewer than num_requests samples, copies are randomly drawn to reach the target count.
get_random_lora_request(max_loras, lora_path) — optionally attaches a LoRARequest to each sample.
apply_multimodal_chat_transformation(prompt, mm_content) — converts text + multimodal content into a chat-format message list.

Supported Dataset Classes

Sources: vllm/benchmarks/datasets.py90-253 vllm/benchmarks/throughput.py19-35

Dataset class	`--dataset-name` / path	Key parameters
`RandomDataset`	`random`	`input_len`, `output_len`, `prefix_len`, `range_ratio`
`RandomMultiModalDataset`	`random-mm`	`bucket_config`, `base_items_per_request`, `limit_mm_per_prompt`
`RandomDatasetForReranking`	`random-rerank`	`batchsize`, `is_reranker`
`ShareGPTDataset`	`sharegpt`	`output_len` (optional override)
`SonnetDataset`	`sonnet`	`prefix_len`, `input_len`, `output_len`
`BurstGPTDataset`	`burstgpt`	—
`VisionArenaDataset`	`hf` + specific path	`enable_multimodal_chat`
`InstructCoderDataset`	`hf` + specific path	—
`ConversationDataset`	`hf` + specific path	`dataset_subset`, `dataset_split`
`MultiModalConversationDataset`	`hf` + specific path	`enable_multimodal_chat`
`AIMODataset`	`hf` + specific path	—
`PrefixRepetitionRandomDataset`	`prefix_repetition`	`prefix_len`, `suffix_len`, `num_prefixes`, `output_len`

Sources: vllm/benchmarks/throughput.py338-476 vllm/benchmarks/datasets.py443-1200

`RandomDataset` Internals

RandomDataset uses numpy.default_rng seeded with random_seed for isolation from global RNG state. Token generation works as follows:

get_sampling_params() draws uniform integer lengths for input and output from [floor(len*(1-r)), ceil(len*(1+r))].
A shared prefix of prefix_len tokens is generated once via get_prefix().
Each request's non-prefix tokens are sampled deterministically: allowed_tokens[(offset + index + arange(input_len)) % len(allowed_tokens)].
gen_prompt_decode_to_target_len() decodes the token sequence to a string and re-encodes it up to 10 times, adjusting length, to compensate for tokenizer non-bijections.

Sources: vllm/benchmarks/datasets.py443-688

Multimodal Content Helpers

Two functions convert raw media into the OpenAI chat API format:

process_image(image) — accepts PIL.Image, path/URL string, or {"bytes": ...} dict; returns {"type": "image_url", "image_url": {"url": ...}}.
process_video(video) — accepts path/URL string or {"bytes": ...} dict; returns {"type": "video_url", "video_url": {"url": ...}}.

RandomMultiModalDataset.generate_mm_item() calls these after generating synthetic pixel data via generate_synthetic_image() or generate_synthetic_video() (using OpenCV).

Sources: vllm/benchmarks/datasets.py296-379 vllm/benchmarks/datasets.py852-964

Endpoint Request Function Abstraction

vllm/benchmarks/lib/endpoint_request_func.py provides a protocol-based abstraction for communicating with different inference backends during online benchmarks.

Core Data Classes

Class	Purpose
`RequestFuncInput`	Input to any request function: `prompt`, `api_url`, `prompt_len`, `output_len`, `model`, `multi_modal_content`, etc.
`RequestFuncOutput`	Output with timing: `generated_text`, `success`, `latency`, `ttft`, `itl` (list), `tpot`, `output_tokens`, `error`, `start_time`
`RequestFunc`	Protocol: `(RequestFuncInput, ClientSession, tqdm?) → Awaitable[RequestFuncOutput]`
`StreamedResponseHandler`	Accumulates SSE chunks into complete messages by buffering until `\n\n` separators

Sources: vllm/benchmarks/lib/endpoint_request_func.py63-106

`ASYNC_REQUEST_FUNCS` Registry

A dict mapping backend name string to an async request function:

Sources: vllm/benchmarks/lib/endpoint_request_func.py152-810

All streaming functions record TTFT on the first chunk and append per-token inter-token latencies (itl) on each subsequent chunk. The final output.latency is last_chunk_timestamp - start.

Online Serving Benchmark (`vllm bench serve`)

Module: vllm/benchmarks/serve.py

Request Rate Control

get_request() is an async generator that emits (SampleRequest, current_rate) tuples. Inter-request delay is sampled from a Gamma distribution:

burstiness=1.0 → exponential inter-arrivals (Poisson process)
burstiness < 1 → more bursty
burstiness > 1 → more uniform

Sources: vllm/benchmarks/serve.py217-339

`benchmark()` Function Flow

Sources: vllm/benchmarks/serve.py603-900

`BenchmarkMetrics`

Computed by calculate_metrics() in vllm/benchmarks/serve.py391-600:

Field	Description
`completed` / `failed`	Count of succeeded/failed requests
`total_input` / `total_output`	Sum of input/output token counts
`request_throughput`	`completed / dur_s`
`request_goodput`	Requests meeting SLO constraints / `dur_s`
`output_throughput`	`total_output / dur_s`
`total_token_throughput`	`(total_input + total_output) / dur_s`
`mean/median/std/percentiles_ttft_ms`	Time to first token distribution (ms)
`mean/median/std/percentiles_tpot_ms`	Time per output token (ms)
`mean/median/std/percentiles_itl_ms`	Inter-token latency distribution (ms)
`mean/median/std/percentiles_e2el_ms`	End-to-end latency (ms)
`max_output_tokens_per_s`	Peak tokens/s in any single 1-second bucket
`max_concurrent_requests`	Peak concurrent requests in any 1-second window
`rtfx`	Inverse Real-Time Factor (`input_audio_duration / dur_s`) for ASR

Goodput is computed by checking each request's ttft, tpot, and/or e2el against SLO thresholds specified via --goodput.

Embedding Benchmark Metrics

Sources: vllm/benchmarks/serve.py169-215 vllm/benchmarks/serve.py342-388

Speculative Decoding Metrics

Sources: vllm/benchmarks/serve.py94-161

Offline Throughput Benchmark (`vllm bench throughput`)

Module: vllm/benchmarks/throughput.py

Three execution backends are supported:

Function	Backend	Notes
`run_vllm()`	`vllm`	Uses `LLM.generate()` or `LLM.beam_search()` synchronously
`run_vllm_chat()`	`vllm-chat`	Uses `LLM.chat()`, intended for multimodal models
`run_vllm_async()`	`vllm-async`	Uses `AsyncLLM` via `build_async_engine_client_from_engine_args`
`run_hf()`	`hf`	Uses `AutoModelForCausalLM` directly for comparison

run_vllm_async() submits requests concurrently via merge_async_iterators() and consumes all outputs before stopping the timer.

Sources: vllm/benchmarks/throughput.py46-248

Dataset Selection (`get_requests`)

filter_requests_for_dp() filters the request list to ensure divisibility by data_parallel_size for data-parallel execution modes.

Sources: vllm/benchmarks/throughput.py338-484

Latency Benchmark (`vllm bench latency`)

Module: vllm/benchmarks/latency.py

Measures end-to-end latency of a single fixed-size batch over repeated iterations.

Configuration

Key CLI arguments:

Argument	Default	Description
`--input-len`	32	Prompt token count
`--output-len`	128	Output token count
`--batch-size`	8	Number of prompts per batch
`--num-iters-warmup`	10	Warmup iterations (excluded from stats)
`--num-iters`	30	Benchmark iterations
`--profile`	False	Enables torch/CUDA profiler for one iteration

Prefix caching is disabled by default (parser.set_defaults(enable_prefix_caching=False)) to avoid cache-skewed results vllm/benchmarks/latency.py77

Execution

Sources: vllm/benchmarks/latency.py80-172

Endpoint Readiness Checker

This is called by benchmark() in serve.py before the warmup phase to ensure the server is ready.

Sources: vllm/benchmarks/lib/ready_checker.py18-79

Result Output and Visualization

JSON Output

Both serve.py and throughput.py support --output-json to write full result dictionaries. The JSON includes aggregate statistics, per-request latencies, input/output lengths, and errors.

Timeline Plot

construct_timeline_data() converts the raw start_time, ttft, itl[], and latency fields into Gantt-chart-style records with ISO-formatted timestamps.

Sources: vllm/benchmarks/plot.py25-240

Terminal Plot

Sweep Utilities

vllm/benchmarks/sweep/ contains helpers for running parameter sweeps:

server.py — ServerProcess context manager that launches a vllm serve subprocess, waits for readiness, resets caches between benchmark runs via /reset_prefix_cache, /reset_mm_cache, and /reset_encoder_cache endpoints, and terminates the process group on exit.
utils.py — sanitize_filename() for safe output file naming.

Sources: vllm/benchmarks/sweep/server.py14-143

Key Data Flow Summary

Sources: vllm/benchmarks/serve.py391-600 vllm/benchmarks/throughput.py46-248 vllm/benchmarks/datasets.py186-253

Benchmarking

Overview

Dataset Abstraction

SampleRequest

BenchmarkDataset Base Class

Supported Dataset Classes

RandomDataset Internals

Multimodal Content Helpers

Endpoint Request Function Abstraction

Core Data Classes

ASYNC_REQUEST_FUNCS Registry

Online Serving Benchmark (vllm bench serve)

Request Rate Control

benchmark() Function Flow

BenchmarkMetrics

Embedding Benchmark Metrics

Speculative Decoding Metrics

Offline Throughput Benchmark (vllm bench throughput)

Dataset Selection (get_requests)

Latency Benchmark (vllm bench latency)

Configuration

Execution

Endpoint Readiness Checker

Result Output and Visualization

JSON Output

Timeline Plot

Terminal Plot

Sweep Utilities

Key Data Flow Summary

On this page

Benchmarking

Overview

Dataset Abstraction

SampleRequest

BenchmarkDataset Base Class

Supported Dataset Classes

RandomDataset Internals

Multimodal Content Helpers

Endpoint Request Function Abstraction

Core Data Classes

ASYNC_REQUEST_FUNCS Registry

Online Serving Benchmark (vllm bench serve)

Request Rate Control

benchmark() Function Flow

BenchmarkMetrics

Embedding Benchmark Metrics

Speculative Decoding Metrics

Offline Throughput Benchmark (vllm bench throughput)

Dataset Selection (get_requests)

Latency Benchmark (vllm bench latency)

Configuration

Execution

Endpoint Readiness Checker

Result Output and Visualization

JSON Output

Timeline Plot

Terminal Plot

Sweep Utilities

Key Data Flow Summary

On this page

`SampleRequest`

`BenchmarkDataset` Base Class

`RandomDataset` Internals

`ASYNC_REQUEST_FUNCS` Registry

Online Serving Benchmark (`vllm bench serve`)

`benchmark()` Function Flow

`BenchmarkMetrics`

Offline Throughput Benchmark (`vllm bench throughput`)

Dataset Selection (`get_requests`)

Latency Benchmark (`vllm bench latency`)

`SampleRequest`

`BenchmarkDataset` Base Class

`RandomDataset` Internals

`ASYNC_REQUEST_FUNCS` Registry

Online Serving Benchmark (`vllm bench serve`)

`benchmark()` Function Flow

`BenchmarkMetrics`

Offline Throughput Benchmark (`vllm bench throughput`)

Dataset Selection (`get_requests`)

Latency Benchmark (`vllm bench latency`)