Voice Mode

Relevant source files

This page describes Langflow's voice mode feature: its WebSocket API endpoints, the two streaming modes it supports, how it integrates with the OpenAI Realtime API and ElevenLabs, and how to configure voice-enabled flows. For information about the general chat interface and text-based message handling, see Chat Interface. For information about the broader API endpoint surface, see API Endpoints.

Overview

Voice mode enables real-time audio interaction with Langflow flows using a microphone and speakers. It is implemented as a pair of WebSocket endpoints that act as a bridge between a browser client (or any WebSocket consumer) and the OpenAI Realtime API. The flow runs on the Langflow backend and is invoked either as a tool (voice-to-voice) or directly after speech transcription (speech-to-text).

Voice mode is not available in Langflow Desktop. It requires the Langflow OSS Python package.

Sources: docs/docs/Develop/concepts-voice-mode.mdx1-96

Prerequisites

Requirement	Details
Langflow OSS	Must be installed as a Python package; Langflow Desktop does not support voice mode
Flow structure	Must contain Chat Input, Language Model, and Chat Output components
OpenAI API key	Required for all voice mode sessions; used to authenticate with the OpenAI Realtime API
ElevenLabs API key	Optional; enables additional voice options for LLM responses
Microphone and speakers	Physical or virtual audio devices accessible by the browser
Agent flows	If an Agent component is present, tools must have accurate names and descriptions; voice mode overrides any text in the Agent Instructions field

Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30

WebSocket Endpoint Architecture

Diagram: Voice Mode WebSocket Data Flow

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95

The Two Voice Mode Endpoints

Both endpoints are WebSocket endpoints that accept an OpenAI API key for authentication. Both support an optional /$SESSION_ID path segment.

Diagram: Endpoint vs. Strategy Mapping

Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

This endpoint connects the client to the OpenAI Realtime voice model. The flow is registered as a callable tool. The OpenAI model listens to incoming audio, and decides autonomously when to invoke the Langflow flow.

Mechanism: OpenAI Realtime API tool-calling. The flow's schema is exposed as a tool definition.
Latency: Lower, because the audio response loop lives inside OpenAI's infrastructure.
Determinism: Lower, because OpenAI controls invocation timing.
Best for: Conversational agents where low latency matters more than strict per-message flow execution.

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription

This endpoint uses OpenAI Realtime voice transcription to convert incoming audio to text. Each completed transcript is passed directly to the Langflow flow as input, and the flow's response is returned to the client (optionally via ElevenLabs TTS).

Mechanism: Audio → OpenAI transcription → text → Langflow flow invocation → response.
Latency: Higher, because transcription and flow execution are sequential steps.
Determinism: Higher, because the flow is invoked for every complete transcript.
Best for: Applications where every utterance must drive an explicit flow execution (e.g., the built-in Langflow Playground).

Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86

Session IDs

Both endpoints accept an optional /{session_id} path parameter appended after the flow_id:

/ws/flow_as_tool/{flow_id}/{session_id}
/ws/flow_tts/{flow_id}/{session_id}

Scenario	Behavior
`session_id` provided	Langflow uses the given value as the conversation session ID
`session_id` omitted	Langflow falls back to using the `flow_id` as the session ID
Session ends (Playground closed, connection dropped)	Verbal chat history for that session is discarded and not persisted for future sessions

Voice mode maintains context only within the current WebSocket connection lifetime. Chat history from voice sessions is not stored in the standard message persistence layer.

Sources: docs/docs/Develop/concepts-voice-mode.mdx88-94

Integration Points

Diagram: Voice Mode Integration Overview

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95

Configuration

Required: OpenAI API Key

An OpenAI API key must be supplied when establishing a voice mode WebSocket session. In the Playground, this key is entered via the Voice mode dialog and saved as a Langflow global variable.

When building applications against the WebSocket endpoints directly, the OpenAI API key must be provided as part of the connection handshake (following the same pattern as the OpenAI Realtime API WebSocket connection).

Optional: ElevenLabs API Key

An ElevenLabs API key enables a broader selection of voices for the LLM's spoken responses. Like the OpenAI key, it is stored as a Langflow global variable when set through the Playground UI.

Required Flow Structure

Component	Role in Voice Mode
Chat Input	Entry point for the transcribed or tool-invoked text
Language Model	Processes the input and generates a response
Chat Output	Emits the response back to the voice mode endpoint

When an Agent component is present in the flow, the following constraints apply:

All tools in the flow must have accurate name and description fields (used by the model for tool selection).
Any text in the Agent Instructions field is overridden by voice mode.

Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30 docs/docs/Develop/concepts-voice-mode.mdx38-62

Using Voice Mode in the Playground

The Langflow Playground uses the /ws/flow_tts/{flow_id} endpoint internally.

Open a flow containing Chat Input, Language Model, and Chat Output components.
Click Playground.
Click the Microphone icon to open the Voice mode dialog.
Enter your OpenAI API key and click Save.
Grant microphone access when prompted.
Select an Audio Input device.
Optionally, enter an ElevenLabs API key and select a Preferred Language.
Speak into the microphone. The waveform animates to reflect detected audio. The transcribed input and the agent's response appear in the Playground chat view and are played back through speakers.

Sources: docs/docs/Develop/concepts-voice-mode.mdx33-62

Building Applications Against the WebSocket Endpoints

The Langflow voice WebSocket endpoints are designed to be API-compatible with the OpenAI Realtime API WebSocket interface. Any client library or code that works against OpenAI Realtime WebSockets can be pointed at the Langflow endpoints with minimal changes.

Key substitutions when adapting OpenAI Realtime client code:

OpenAI Realtime	Langflow Equivalent
`wss://api.openai.com/v1/realtime`	`ws://<langflow-host>/ws/flow_as_tool/<flow_id>` or `ws://<langflow-host>/ws/flow_tts/<flow_id>`
Model parameter	Replaced by `flow_id` in the URL path
Session context	Optionally scoped by `session_id` path parameter

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-68

Limitations

Limitation	Details
Langflow Desktop	Voice mode is unavailable; requires OSS Python package
Session persistence	Voice chat history is discarded when the connection closes
Agent Instructions	Text in Agent Instructions is overridden by voice mode when an Agent component is used
OpenAI dependency	Voice mode requires an active OpenAI account and API key; no alternative transcription provider is supported in the current implementation

Sources: docs/docs/Develop/concepts-voice-mode.mdx8-11 docs/docs/Get-Started/get-started-installation.mdx23-25 docs/docs/Develop/concepts-voice-mode.mdx88-94

Voice Mode

Relevant source files

Overview

Voice mode is not available in Langflow Desktop. It requires the Langflow OSS Python package.

Sources: docs/docs/Develop/concepts-voice-mode.mdx1-96

Prerequisites

Requirement	Details
Langflow OSS	Must be installed as a Python package; Langflow Desktop does not support voice mode
Flow structure	Must contain Chat Input, Language Model, and Chat Output components
OpenAI API key	Required for all voice mode sessions; used to authenticate with the OpenAI Realtime API
ElevenLabs API key	Optional; enables additional voice options for LLM responses
Microphone and speakers	Physical or virtual audio devices accessible by the browser
Agent flows	If an Agent component is present, tools must have accurate names and descriptions; voice mode overrides any text in the Agent Instructions field

Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30

WebSocket Endpoint Architecture

Diagram: Voice Mode WebSocket Data Flow

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95

The Two Voice Mode Endpoints

Both endpoints are WebSocket endpoints that accept an OpenAI API key for authentication. Both support an optional /$SESSION_ID path segment.

Diagram: Endpoint vs. Strategy Mapping

Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

Mechanism: OpenAI Realtime API tool-calling. The flow's schema is exposed as a tool definition.
Latency: Lower, because the audio response loop lives inside OpenAI's infrastructure.
Determinism: Lower, because OpenAI controls invocation timing.
Best for: Conversational agents where low latency matters more than strict per-message flow execution.

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription

Mechanism: Audio → OpenAI transcription → text → Langflow flow invocation → response.
Latency: Higher, because transcription and flow execution are sequential steps.
Determinism: Higher, because the flow is invoked for every complete transcript.
Best for: Applications where every utterance must drive an explicit flow execution (e.g., the built-in Langflow Playground).

Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86

Session IDs

Both endpoints accept an optional /{session_id} path parameter appended after the flow_id:

/ws/flow_as_tool/{flow_id}/{session_id}
/ws/flow_tts/{flow_id}/{session_id}

Scenario	Behavior
`session_id` provided	Langflow uses the given value as the conversation session ID
`session_id` omitted	Langflow falls back to using the `flow_id` as the session ID
Session ends (Playground closed, connection dropped)	Verbal chat history for that session is discarded and not persisted for future sessions

Voice mode maintains context only within the current WebSocket connection lifetime. Chat history from voice sessions is not stored in the standard message persistence layer.

Sources: docs/docs/Develop/concepts-voice-mode.mdx88-94

Integration Points

Diagram: Voice Mode Integration Overview

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95

Configuration

Required: OpenAI API Key

An OpenAI API key must be supplied when establishing a voice mode WebSocket session. In the Playground, this key is entered via the Voice mode dialog and saved as a Langflow global variable.

Optional: ElevenLabs API Key

An ElevenLabs API key enables a broader selection of voices for the LLM's spoken responses. Like the OpenAI key, it is stored as a Langflow global variable when set through the Playground UI.

Required Flow Structure

Component	Role in Voice Mode
Chat Input	Entry point for the transcribed or tool-invoked text
Language Model	Processes the input and generates a response
Chat Output	Emits the response back to the voice mode endpoint

When an Agent component is present in the flow, the following constraints apply:

All tools in the flow must have accurate name and description fields (used by the model for tool selection).
Any text in the Agent Instructions field is overridden by voice mode.

Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30 docs/docs/Develop/concepts-voice-mode.mdx38-62

Using Voice Mode in the Playground

The Langflow Playground uses the /ws/flow_tts/{flow_id} endpoint internally.

Open a flow containing Chat Input, Language Model, and Chat Output components.
Click Playground.
Click the Microphone icon to open the Voice mode dialog.
Enter your OpenAI API key and click Save.
Grant microphone access when prompted.
Select an Audio Input device.
Optionally, enter an ElevenLabs API key and select a Preferred Language.
Speak into the microphone. The waveform animates to reflect detected audio. The transcribed input and the agent's response appear in the Playground chat view and are played back through speakers.

Sources: docs/docs/Develop/concepts-voice-mode.mdx33-62

Building Applications Against the WebSocket Endpoints

Key substitutions when adapting OpenAI Realtime client code:

OpenAI Realtime	Langflow Equivalent
`wss://api.openai.com/v1/realtime`	`ws://<langflow-host>/ws/flow_as_tool/<flow_id>` or `ws://<langflow-host>/ws/flow_tts/<flow_id>`
Model parameter	Replaced by `flow_id` in the URL path
Session context	Optionally scoped by `session_id` path parameter

Sources: docs/docs/Develop/concepts-voice-mode.mdx64-68

Limitations

Limitation	Details
Langflow Desktop	Voice mode is unavailable; requires OSS Python package
Session persistence	Voice chat history is discarded when the connection closes
Agent Instructions	Text in Agent Instructions is overridden by voice mode when an Agent component is used
OpenAI dependency	Voice mode requires an active OpenAI account and API key; no alternative transcription provider is supported in the current implementation

Sources: docs/docs/Develop/concepts-voice-mode.mdx8-11 docs/docs/Get-Started/get-started-installation.mdx23-25 docs/docs/Develop/concepts-voice-mode.mdx88-94

Voice Mode

Overview

Prerequisites

WebSocket Endpoint Architecture

The Two Voice Mode Endpoints

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription

Session IDs

Integration Points

Configuration

Required: OpenAI API Key

Optional: ElevenLabs API Key

Required Flow Structure

Using Voice Mode in the Playground

Building Applications Against the WebSocket Endpoints

Limitations

See Also

On this page

Voice Mode

Overview

Prerequisites

WebSocket Endpoint Architecture

The Two Voice Mode Endpoints

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription

Session IDs

Integration Points

Configuration

Required: OpenAI API Key

Optional: ElevenLabs API Key

Required Flow Structure

Using Voice Mode in the Playground

Building Applications Against the WebSocket Endpoints

Limitations

See Also

On this page

Voice Mode

Overview

Prerequisites

WebSocket Endpoint Architecture

The Two Voice Mode Endpoints

/ws/flow_as_tool/{flow_id} — Voice-to-Voice Streaming

/ws/flow_tts/{flow_id} — Speech-to-Text Transcription

Session IDs

Integration Points

Configuration

Required: OpenAI API Key

Optional: ElevenLabs API Key

Required Flow Structure

Using Voice Mode in the Playground

Building Applications Against the WebSocket Endpoints

Limitations

See Also

On this page

Voice Mode

Overview

Prerequisites

WebSocket Endpoint Architecture

The Two Voice Mode Endpoints

/ws/flow_as_tool/{flow_id} — Voice-to-Voice Streaming

/ws/flow_tts/{flow_id} — Speech-to-Text Transcription

Session IDs

Integration Points

Configuration

Required: OpenAI API Key

Optional: ElevenLabs API Key

Required Flow Structure

Using Voice Mode in the Playground

Building Applications Against the WebSocket Endpoints

Limitations

See Also

On this page

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription

`/ws/flow_as_tool/{flow_id}` — Voice-to-Voice Streaming

`/ws/flow_tts/{flow_id}` — Speech-to-Text Transcription