This page documents the scheduling algorithm and resource allocation mechanisms in vLLM's v1 engine architecture. The Scheduler class is responsible for deciding which requests to process at each step, allocating GPU memory (KV cache blocks) to requests, and managing resource constraints to maximize throughput while preventing out-of-memory conditions.
For information about how the scheduler fits into the overall engine architecture, see Engine Core and Client APIs. For details on request state transitions and lifecycle, see Request Lifecycle and State Management. For the KV cache block management implementation details, see KV Cache Management.
Sources: vllm/v1/core/sched/scheduler.py1-100
The Scheduler class operates as a central coordinator that makes scheduling decisions at each engine step. It maintains request queues, tracks resource usage, and interfaces with the KVCacheManager to allocate memory blocks for request execution.
Sources: vllm/v1/core/sched/scheduler.py63-266 vllm/v1/core/kv_cache_manager.py94-142
The scheduler maintains two primary data structures for tracking requests:
| Queue | Type | Purpose |
|---|---|---|
waiting | RequestQueue | Holds requests waiting to be scheduled (WAITING, WAITING_FOR_FSM, WAITING_FOR_REMOTE_KVS, WAITING_FOR_STREAMING_REQ status) |
running | list[Request] | Holds requests currently being executed (RUNNING status) |
finished_req_ids | set[str] | Tracks requests that finished in the current step (cleared after each step) |
The RequestQueue implementation supports different scheduling policies:
Sources: vllm/v1/core/sched/scheduler.py148-169 vllm/v1/core/sched/request_queue.py
The schedule() method executes a two-phase scheduling algorithm every engine step:
Sources: vllm/v1/core/sched/scheduler.py318-520
Sources: vllm/v1/core/sched/scheduler.py531-831
The scheduler enforces multiple resource constraints simultaneously:
The primary constraint is the token budget (max_num_scheduled_tokens), which limits the total number of tokens that can be scheduled in a single step:
Sources: vllm/v1/core/sched/scheduler.py337-340
The scheduler limits the number of concurrent requests:
Sources: vllm/v1/core/sched/scheduler.py538-539
KV cache availability is checked through allocate_slots():
Sources: vllm/v1/core/sched/scheduler.py722-740
For multimodal models, an additional encoder budget tracks the number of encoder input tokens that can be processed:
Sources: vllm/v1/core/sched/scheduler.py342-344 vllm/v1/core/sched/scheduler.py389-405
| Constraint | Type | Purpose |
|---|---|---|
max_num_scheduled_tokens | Token count | Limits batch size to prevent OOM |
max_num_running_reqs | Request count | Limits concurrent requests for memory |
| KV cache blocks | Memory | Actual GPU memory for KV cache |
max_num_encoder_input_tokens | Token count | Limits multimodal encoder compute |
max_model_len | Token count | Maximum sequence length per request |
Sources: vllm/v1/core/sched/scheduler.py101-108
KV cache allocation is performed through the KVCacheManager.allocate_slots() method, which handles multiple scenarios:
Sources: vllm/v1/core/kv_cache_manager.py206-376
The allocation logic handles different token types with specific ordering:
┌─────────────────────────────────────────────────────────────────────────┐
│ Block Allocation Layout for a Request │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┬─────────────┬──────────────┬──────────┬──────────────┐ │
│ │ Computed │ New │ External │ New │ Lookahead │ │
│ │ (local │ Computed │ Computed │ Tokens │ (spec │ │
│ │ prefix │ (prefix │ (KV transfer)│ │ decode) │ │
│ │ cache) │ cache hit) │ │ │ │ │
│ └──────────┴─────────────┴──────────────┴──────────┴──────────────┘ │
│ │
│ num_computed_tokens ───────┘ │
│ │
│ num_tokens ────────────────────────────────────────┘ │
│ │
│ num_tokens_need_slot ──────────────────────────────────────────────┘ │
│ │
│ ← Blocks to be cached (enable_caching) ──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Sources: vllm/v1/core/kv_cache_manager.py238-285
When a request is first scheduled, the scheduler queries for prefix cache hits:
The get_computed_blocks() method searches the prefix cache for matching block hashes:
Sources: vllm/v1/core/sched/scheduler.py601-606 vllm/v1/core/kv_cache_manager.py164-204
When KV cache allocation fails due to insufficient blocks, the scheduler preempts requests to free memory:
Sources: vllm/v1/core/sched/scheduler.py443-476
The _preempt_request() method handles the preemption mechanics:
Sources: vllm/v1/core/sched/scheduler.py833-876
When using priority scheduling, the scheduler selects the lowest-priority running request for preemption:
Higher priority values are preempted first. Among requests with the same priority, later arrivals are preempted first.
Sources: vllm/v1/core/sched/scheduler.py445-450
The scheduler supports chunked prefill, where long prompts are split across multiple steps:
This prevents a single long request from monopolizing the token budget.
Sources: vllm/v1/core/sched/scheduler.py379-380 vllm/v1/core/sched/scheduler.py656-658
When speculative decoding is enabled, the scheduler allocates extra slots for lookahead tokens:
Draft tokens are scheduled separately and tracked in request.spec_token_ids.
Sources: vllm/v1/core/sched/scheduler.py491-502 vllm/v1/core/sched/scheduler.py205-214
For multimodal models, the scheduler coordinates both token scheduling and encoder input scheduling:
The method ensures encoder inputs are scheduled when their corresponding text tokens are scheduled.
Sources: vllm/v1/core/sched/scheduler.py389-405 vllm/v1/core/sched/scheduler.py1559-1750
For hybrid models with Mamba layers, the scheduler enforces block-aligned splitting during prefill:
This ensures Mamba state caching works correctly by aligning chunk boundaries to block boundaries.
Sources: vllm/v1/core/sched/scheduler.py407-410 vllm/v1/core/sched/scheduler.py268-316
When async scheduling is enabled (typically with pipeline parallelism), the scheduler allows multiple in-flight steps per request:
This prevents over-scheduling while maintaining pipeline throughput.
Sources: vllm/v1/core/sched/scheduler.py358-372
When KV transfer is enabled (for disaggregated prefill/decode), requests may be in WAITING_FOR_REMOTE_KVS state:
The scheduler coordinates with the KV connector to determine when remote KV data is available.
Sources: vllm/v1/core/sched/scheduler.py544-561 vllm/v1/core/sched/scheduler.py1479-1557
When LoRA adapters are used, the scheduler enforces a maximum number of concurrent LoRA adapters:
Sources: vllm/v1/core/sched/scheduler.py521-530 vllm/v1/core/sched/scheduler.py583-594
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.