This page describes Langflow's voice mode feature: its WebSocket API endpoints, the two streaming modes it supports, how it integrates with the OpenAI Realtime API and ElevenLabs, and how to configure voice-enabled flows. For information about the general chat interface and text-based message handling, see Chat Interface. For information about the broader API endpoint surface, see API Endpoints.
Voice mode enables real-time audio interaction with Langflow flows using a microphone and speakers. It is implemented as a pair of WebSocket endpoints that act as a bridge between a browser client (or any WebSocket consumer) and the OpenAI Realtime API. The flow runs on the Langflow backend and is invoked either as a tool (voice-to-voice) or directly after speech transcription (speech-to-text).
Voice mode is not available in Langflow Desktop. It requires the Langflow OSS Python package.
Sources: docs/docs/Develop/concepts-voice-mode.mdx1-96
| Requirement | Details |
|---|---|
| Langflow OSS | Must be installed as a Python package; Langflow Desktop does not support voice mode |
| Flow structure | Must contain Chat Input, Language Model, and Chat Output components |
| OpenAI API key | Required for all voice mode sessions; used to authenticate with the OpenAI Realtime API |
| ElevenLabs API key | Optional; enables additional voice options for LLM responses |
| Microphone and speakers | Physical or virtual audio devices accessible by the browser |
| Agent flows | If an Agent component is present, tools must have accurate names and descriptions; voice mode overrides any text in the Agent Instructions field |
Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30
Diagram: Voice Mode WebSocket Data Flow
Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95
Both endpoints are WebSocket endpoints that accept an OpenAI API key for authentication. Both support an optional /$SESSION_ID path segment.
Diagram: Endpoint vs. Strategy Mapping
Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86
/ws/flow_as_tool/{flow_id} — Voice-to-Voice StreamingThis endpoint connects the client to the OpenAI Realtime voice model. The flow is registered as a callable tool. The OpenAI model listens to incoming audio, and decides autonomously when to invoke the Langflow flow.
/ws/flow_tts/{flow_id} — Speech-to-Text TranscriptionThis endpoint uses OpenAI Realtime voice transcription to convert incoming audio to text. Each completed transcript is passed directly to the Langflow flow as input, and the flow's response is returned to the client (optionally via ElevenLabs TTS).
Sources: docs/docs/Develop/concepts-voice-mode.mdx73-86
Both endpoints accept an optional /{session_id} path parameter appended after the flow_id:
/ws/flow_as_tool/{flow_id}/{session_id}
/ws/flow_tts/{flow_id}/{session_id}
| Scenario | Behavior |
|---|---|
session_id provided | Langflow uses the given value as the conversation session ID |
session_id omitted | Langflow falls back to using the flow_id as the session ID |
| Session ends (Playground closed, connection dropped) | Verbal chat history for that session is discarded and not persisted for future sessions |
Voice mode maintains context only within the current WebSocket connection lifetime. Chat history from voice sessions is not stored in the standard message persistence layer.
Sources: docs/docs/Develop/concepts-voice-mode.mdx88-94
Diagram: Voice Mode Integration Overview
Sources: docs/docs/Develop/concepts-voice-mode.mdx64-95
An OpenAI API key must be supplied when establishing a voice mode WebSocket session. In the Playground, this key is entered via the Voice mode dialog and saved as a Langflow global variable.
When building applications against the WebSocket endpoints directly, the OpenAI API key must be provided as part of the connection handshake (following the same pattern as the OpenAI Realtime API WebSocket connection).
An ElevenLabs API key enables a broader selection of voices for the LLM's spoken responses. Like the OpenAI key, it is stored as a Langflow global variable when set through the Playground UI.
| Component | Role in Voice Mode |
|---|---|
| Chat Input | Entry point for the transcribed or tool-invoked text |
| Language Model | Processes the input and generates a response |
| Chat Output | Emits the response back to the voice mode endpoint |
When an Agent component is present in the flow, the following constraints apply:
name and description fields (used by the model for tool selection).Sources: docs/docs/Develop/concepts-voice-mode.mdx15-30 docs/docs/Develop/concepts-voice-mode.mdx38-62
The Langflow Playground uses the /ws/flow_tts/{flow_id} endpoint internally.
Sources: docs/docs/Develop/concepts-voice-mode.mdx33-62
The Langflow voice WebSocket endpoints are designed to be API-compatible with the OpenAI Realtime API WebSocket interface. Any client library or code that works against OpenAI Realtime WebSockets can be pointed at the Langflow endpoints with minimal changes.
Key substitutions when adapting OpenAI Realtime client code:
| OpenAI Realtime | Langflow Equivalent |
|---|---|
wss://api.openai.com/v1/realtime | ws://<langflow-host>/ws/flow_as_tool/<flow_id> or ws://<langflow-host>/ws/flow_tts/<flow_id> |
| Model parameter | Replaced by flow_id in the URL path |
| Session context | Optionally scoped by session_id path parameter |
Sources: docs/docs/Develop/concepts-voice-mode.mdx64-68
| Limitation | Details |
|---|---|
| Langflow Desktop | Voice mode is unavailable; requires OSS Python package |
| Session persistence | Voice chat history is discarded when the connection closes |
| Agent Instructions | Text in Agent Instructions is overridden by voice mode when an Agent component is used |
| OpenAI dependency | Voice mode requires an active OpenAI account and API key; no alternative transcription provider is supported in the current implementation |
Sources: docs/docs/Develop/concepts-voice-mode.mdx8-11 docs/docs/Get-Started/get-started-installation.mdx23-25 docs/docs/Develop/concepts-voice-mode.mdx88-94
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.