Scheduler and Resource Allocation

Relevant source files

Purpose and Scope

This page documents the scheduling algorithm and resource allocation mechanisms in vLLM's v1 engine architecture. The Scheduler class is responsible for deciding which requests to process at each step, allocating GPU memory (KV cache blocks) to requests, and managing resource constraints to maximize throughput while preventing out-of-memory conditions.

For information about how the scheduler fits into the overall engine architecture, see Engine Core and Client APIs. For details on request state transitions and lifecycle, see Request Lifecycle and State Management. For the KV cache block management implementation details, see KV Cache Management.

Sources: vllm/v1/core/sched/scheduler.py1-100

Scheduler Architecture Overview

The Scheduler class operates as a central coordinator that makes scheduling decisions at each engine step. It maintains request queues, tracks resource usage, and interfaces with the KVCacheManager to allocate memory blocks for request execution.

Sources: vllm/v1/core/sched/scheduler.py63-266 vllm/v1/core/kv_cache_manager.py94-142

Request Queuing System

The scheduler maintains two primary data structures for tracking requests:

Queue	Type	Purpose
`waiting`	`RequestQueue`	Holds requests waiting to be scheduled (WAITING, WAITING_FOR_FSM, WAITING_FOR_REMOTE_KVS, WAITING_FOR_STREAMING_REQ status)
`running`	`list[Request]`	Holds requests currently being executed (RUNNING status)
`finished_req_ids`	`set[str]`	Tracks requests that finished in the current step (cleared after each step)

The RequestQueue implementation supports different scheduling policies:

Sources: vllm/v1/core/sched/scheduler.py148-169 vllm/v1/core/sched/request_queue.py

Scheduling Algorithm

The schedule() method executes a two-phase scheduling algorithm every engine step:

Phase 1: Schedule Running Requests

Sources: vllm/v1/core/sched/scheduler.py318-520

Phase 2: Schedule Waiting Requests

Sources: vllm/v1/core/sched/scheduler.py531-831

Resource Constraints and Budgets

The scheduler enforces multiple resource constraints simultaneously:

Token Budget

The primary constraint is the token budget (max_num_scheduled_tokens), which limits the total number of tokens that can be scheduled in a single step:

Sources: vllm/v1/core/sched/scheduler.py337-340

Running Request Limit

The scheduler limits the number of concurrent requests:

Sources: vllm/v1/core/sched/scheduler.py538-539

KV Cache Memory

KV cache availability is checked through allocate_slots():

Sources: vllm/v1/core/sched/scheduler.py722-740

Encoder Compute Budget (Multimodal)

For multimodal models, an additional encoder budget tracks the number of encoder input tokens that can be processed:

Sources: vllm/v1/core/sched/scheduler.py342-344 vllm/v1/core/sched/scheduler.py389-405

Resource Constraint Summary

Constraint	Type	Purpose
`max_num_scheduled_tokens`	Token count	Limits batch size to prevent OOM
`max_num_running_reqs`	Request count	Limits concurrent requests for memory
KV cache blocks	Memory	Actual GPU memory for KV cache
`max_num_encoder_input_tokens`	Token count	Limits multimodal encoder compute
`max_model_len`	Token count	Maximum sequence length per request

Sources: vllm/v1/core/sched/scheduler.py101-108

KV Cache Slot Allocation

KV cache allocation is performed through the KVCacheManager.allocate_slots() method, which handles multiple scenarios:

Sources: vllm/v1/core/kv_cache_manager.py206-376

Block Allocation Layout

The allocation logic handles different token types with specific ordering:

┌─────────────────────────────────────────────────────────────────────────┐
│ Block Allocation Layout for a Request                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────┬─────────────┬──────────────┬──────────┬──────────────┐   │
│  │ Computed │ New         │ External     │ New      │ Lookahead    │   │
│  │ (local   │ Computed    │ Computed     │ Tokens   │ (spec        │   │
│  │ prefix   │ (prefix     │ (KV transfer)│          │ decode)      │   │
│  │ cache)   │ cache hit)  │              │          │              │   │
│  └──────────┴─────────────┴──────────────┴──────────┴──────────────┘   │
│                                                                          │
│  num_computed_tokens ───────┘                                           │
│                                                                          │
│  num_tokens ────────────────────────────────────────┘                   │
│                                                                          │
│  num_tokens_need_slot ──────────────────────────────────────────────┘   │
│                                                                          │
│  ← Blocks to be cached (enable_caching) ──────────────┘                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sources: vllm/v1/core/kv_cache_manager.py238-285

Prefix Caching Integration

When a request is first scheduled, the scheduler queries for prefix cache hits:

The get_computed_blocks() method searches the prefix cache for matching block hashes:

Sources: vllm/v1/core/sched/scheduler.py601-606 vllm/v1/core/kv_cache_manager.py164-204

Preemption Strategy

When KV cache allocation fails due to insufficient blocks, the scheduler preempts requests to free memory:

Sources: vllm/v1/core/sched/scheduler.py443-476

Preemption Details

The _preempt_request() method handles the preemption mechanics:

Sources: vllm/v1/core/sched/scheduler.py833-876

Priority-Based Preemption

When using priority scheduling, the scheduler selects the lowest-priority running request for preemption:

Higher priority values are preempted first. Among requests with the same priority, later arrivals are preempted first.

Sources: vllm/v1/core/sched/scheduler.py445-450

Special Scheduling Cases

Chunked Prefill

The scheduler supports chunked prefill, where long prompts are split across multiple steps:

This prevents a single long request from monopolizing the token budget.

Sources: vllm/v1/core/sched/scheduler.py379-380 vllm/v1/core/sched/scheduler.py656-658

Speculative Decoding

When speculative decoding is enabled, the scheduler allocates extra slots for lookahead tokens:

Draft tokens are scheduled separately and tracked in request.spec_token_ids.

Sources: vllm/v1/core/sched/scheduler.py491-502 vllm/v1/core/sched/scheduler.py205-214

Multimodal Input Scheduling

For multimodal models, the scheduler coordinates both token scheduling and encoder input scheduling:

The method ensures encoder inputs are scheduled when their corresponding text tokens are scheduled.

Sources: vllm/v1/core/sched/scheduler.py389-405 vllm/v1/core/sched/scheduler.py1559-1750

Mamba Block-Aligned Splitting

For hybrid models with Mamba layers, the scheduler enforces block-aligned splitting during prefill:

This ensures Mamba state caching works correctly by aligning chunk boundaries to block boundaries.

Sources: vllm/v1/core/sched/scheduler.py407-410 vllm/v1/core/sched/scheduler.py268-316

Async Scheduling with Pipeline Parallelism

When async scheduling is enabled (typically with pipeline parallelism), the scheduler allows multiple in-flight steps per request:

This prevents over-scheduling while maintaining pipeline throughput.

Sources: vllm/v1/core/sched/scheduler.py358-372

KV Transfer Integration

When KV transfer is enabled (for disaggregated prefill/decode), requests may be in WAITING_FOR_REMOTE_KVS state:

The scheduler coordinates with the KV connector to determine when remote KV data is available.

Sources: vllm/v1/core/sched/scheduler.py544-561 vllm/v1/core/sched/scheduler.py1479-1557

LoRA Adapter Constraints

When LoRA adapters are used, the scheduler enforces a maximum number of concurrent LoRA adapters:

Sources: vllm/v1/core/sched/scheduler.py521-530 vllm/v1/core/sched/scheduler.py583-594

Scheduler and Resource Allocation

Relevant source files

Purpose and Scope

Sources: vllm/v1/core/sched/scheduler.py1-100

Scheduler Architecture Overview

Sources: vllm/v1/core/sched/scheduler.py63-266 vllm/v1/core/kv_cache_manager.py94-142

Request Queuing System

The scheduler maintains two primary data structures for tracking requests:

Queue	Type	Purpose
`waiting`	`RequestQueue`	Holds requests waiting to be scheduled (WAITING, WAITING_FOR_FSM, WAITING_FOR_REMOTE_KVS, WAITING_FOR_STREAMING_REQ status)
`running`	`list[Request]`	Holds requests currently being executed (RUNNING status)
`finished_req_ids`	`set[str]`	Tracks requests that finished in the current step (cleared after each step)

The RequestQueue implementation supports different scheduling policies:

Sources: vllm/v1/core/sched/scheduler.py148-169 vllm/v1/core/sched/request_queue.py

Scheduling Algorithm

The schedule() method executes a two-phase scheduling algorithm every engine step:

Phase 1: Schedule Running Requests

Sources: vllm/v1/core/sched/scheduler.py318-520

Phase 2: Schedule Waiting Requests

Sources: vllm/v1/core/sched/scheduler.py531-831

Resource Constraints and Budgets

The scheduler enforces multiple resource constraints simultaneously:

Token Budget

The primary constraint is the token budget (max_num_scheduled_tokens), which limits the total number of tokens that can be scheduled in a single step:

Sources: vllm/v1/core/sched/scheduler.py337-340

Running Request Limit

The scheduler limits the number of concurrent requests:

Sources: vllm/v1/core/sched/scheduler.py538-539

KV Cache Memory

KV cache availability is checked through allocate_slots():

Sources: vllm/v1/core/sched/scheduler.py722-740

Encoder Compute Budget (Multimodal)

For multimodal models, an additional encoder budget tracks the number of encoder input tokens that can be processed:

Sources: vllm/v1/core/sched/scheduler.py342-344 vllm/v1/core/sched/scheduler.py389-405

Resource Constraint Summary

Constraint	Type	Purpose
`max_num_scheduled_tokens`	Token count	Limits batch size to prevent OOM
`max_num_running_reqs`	Request count	Limits concurrent requests for memory
KV cache blocks	Memory	Actual GPU memory for KV cache
`max_num_encoder_input_tokens`	Token count	Limits multimodal encoder compute
`max_model_len`	Token count	Maximum sequence length per request

Sources: vllm/v1/core/sched/scheduler.py101-108

KV Cache Slot Allocation

KV cache allocation is performed through the KVCacheManager.allocate_slots() method, which handles multiple scenarios:

Sources: vllm/v1/core/kv_cache_manager.py206-376

Block Allocation Layout

The allocation logic handles different token types with specific ordering:

┌─────────────────────────────────────────────────────────────────────────┐
│ Block Allocation Layout for a Request                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────┬─────────────┬──────────────┬──────────┬──────────────┐   │
│  │ Computed │ New         │ External     │ New      │ Lookahead    │   │
│  │ (local   │ Computed    │ Computed     │ Tokens   │ (spec        │   │
│  │ prefix   │ (prefix     │ (KV transfer)│          │ decode)      │   │
│  │ cache)   │ cache hit)  │              │          │              │   │
│  └──────────┴─────────────┴──────────────┴──────────┴──────────────┘   │
│                                                                          │
│  num_computed_tokens ───────┘                                           │
│                                                                          │
│  num_tokens ────────────────────────────────────────┘                   │
│                                                                          │
│  num_tokens_need_slot ──────────────────────────────────────────────┘   │
│                                                                          │
│  ← Blocks to be cached (enable_caching) ──────────────┘                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sources: vllm/v1/core/kv_cache_manager.py238-285

Prefix Caching Integration

When a request is first scheduled, the scheduler queries for prefix cache hits:

The get_computed_blocks() method searches the prefix cache for matching block hashes:

Sources: vllm/v1/core/sched/scheduler.py601-606 vllm/v1/core/kv_cache_manager.py164-204

Preemption Strategy

When KV cache allocation fails due to insufficient blocks, the scheduler preempts requests to free memory:

Sources: vllm/v1/core/sched/scheduler.py443-476

Preemption Details

The _preempt_request() method handles the preemption mechanics:

Sources: vllm/v1/core/sched/scheduler.py833-876

Priority-Based Preemption

When using priority scheduling, the scheduler selects the lowest-priority running request for preemption:

Higher priority values are preempted first. Among requests with the same priority, later arrivals are preempted first.

Sources: vllm/v1/core/sched/scheduler.py445-450

Special Scheduling Cases

Chunked Prefill

The scheduler supports chunked prefill, where long prompts are split across multiple steps:

This prevents a single long request from monopolizing the token budget.

Sources: vllm/v1/core/sched/scheduler.py379-380 vllm/v1/core/sched/scheduler.py656-658

Speculative Decoding

When speculative decoding is enabled, the scheduler allocates extra slots for lookahead tokens:

Draft tokens are scheduled separately and tracked in request.spec_token_ids.

Sources: vllm/v1/core/sched/scheduler.py491-502 vllm/v1/core/sched/scheduler.py205-214

Multimodal Input Scheduling

For multimodal models, the scheduler coordinates both token scheduling and encoder input scheduling:

The method ensures encoder inputs are scheduled when their corresponding text tokens are scheduled.

Sources: vllm/v1/core/sched/scheduler.py389-405 vllm/v1/core/sched/scheduler.py1559-1750

Mamba Block-Aligned Splitting

For hybrid models with Mamba layers, the scheduler enforces block-aligned splitting during prefill:

This ensures Mamba state caching works correctly by aligning chunk boundaries to block boundaries.

Sources: vllm/v1/core/sched/scheduler.py407-410 vllm/v1/core/sched/scheduler.py268-316

Async Scheduling with Pipeline Parallelism

When async scheduling is enabled (typically with pipeline parallelism), the scheduler allows multiple in-flight steps per request:

This prevents over-scheduling while maintaining pipeline throughput.

Sources: vllm/v1/core/sched/scheduler.py358-372

KV Transfer Integration

When KV transfer is enabled (for disaggregated prefill/decode), requests may be in WAITING_FOR_REMOTE_KVS state:

The scheduler coordinates with the KV connector to determine when remote KV data is available.

Sources: vllm/v1/core/sched/scheduler.py544-561 vllm/v1/core/sched/scheduler.py1479-1557

LoRA Adapter Constraints

When LoRA adapters are used, the scheduler enforces a maximum number of concurrent LoRA adapters:

Sources: vllm/v1/core/sched/scheduler.py521-530 vllm/v1/core/sched/scheduler.py583-594

Scheduler and Resource Allocation

Purpose and Scope

Scheduler Architecture Overview

Request Queuing System

Scheduling Algorithm

Phase 1: Schedule Running Requests

Phase 2: Schedule Waiting Requests

Resource Constraints and Budgets

Token Budget

Running Request Limit

KV Cache Memory

Encoder Compute Budget (Multimodal)

Resource Constraint Summary

KV Cache Slot Allocation

Block Allocation Layout

Prefix Caching Integration

Preemption Strategy

Preemption Details

Priority-Based Preemption

Special Scheduling Cases

Chunked Prefill

Speculative Decoding

Multimodal Input Scheduling

Mamba Block-Aligned Splitting

Async Scheduling with Pipeline Parallelism

KV Transfer Integration

LoRA Adapter Constraints

On this page

Scheduler and Resource Allocation

Purpose and Scope

Scheduler Architecture Overview

Request Queuing System

Scheduling Algorithm

Phase 1: Schedule Running Requests

Phase 2: Schedule Waiting Requests

Resource Constraints and Budgets

Token Budget

Running Request Limit

KV Cache Memory

Encoder Compute Budget (Multimodal)

Resource Constraint Summary

KV Cache Slot Allocation

Block Allocation Layout

Prefix Caching Integration

Preemption Strategy

Preemption Details

Priority-Based Preemption

Special Scheduling Cases

Chunked Prefill

Speculative Decoding

Multimodal Input Scheduling

Mamba Block-Aligned Splitting

Async Scheduling with Pipeline Parallelism

KV Transfer Integration

LoRA Adapter Constraints

On this page