This page documents how RAGFlow generates responses from retrieved document chunks and embeds citations to source materials. After the retrieval and reranking pipeline produces relevant chunks (9.3), the system must format these chunks for LLM consumption, generate a coherent answer, insert citations linking answer segments to source chunks, and structure reference metadata for display.
For information about the preceding retrieval and ranking steps, see Reranking and Filtering. For details on how LLM models are invoked, see LLMBundle and Model Types.
The response generation pipeline transforms ranked chunks into a cited answer through five key stages:
Sources: api/db/services/dialog_service.py411-680
Before response generation begins, the system retrieves chunks via the hybrid search mechanism. The retrieval result contains:
| Field | Type | Description |
|---|---|---|
total | int | Total matching chunks |
ids | list[str] | Chunk IDs in ranked order |
field | dict | Chunk content and metadata |
highlight | dict | Highlighted keywords in content |
aggregation | list/dict | Document-level aggregations |
Key Data Structure: The retrieval returns a kbinfos dictionary with two main components:
Sources: api/db/services/dialog_service.py516-680 rag/nlp/search.py42-51 rag/nlp/search.py158-168
The chunks_format function transforms raw chunks into a structured knowledge string for LLM consumption. This formatting is critical for enabling the LLM to generate proper citations.
Chunks are formatted with numeric IDs that the LLM can reference in its answer:
Format Examples:
For quote=True (default):
[ID: 0] First chunk content here...
[ID: 1] Second chunk content here...
[ID: 2] Third chunk content here...
For quote=False:
0. First chunk content here...
1. Second chunk content here...
2. Third chunk content here...
The function handles token limits by truncating chunks to fit within max_tokens (typically determined by LLM context window minus prompt overhead).
Sources: api/db/services/dialog_service.py555-650 rag/prompts/generator.py (referenced via import)
The complete prompt sent to the LLM consists of three parts:
The system prompt is defined in dialog.prompt_config.system and contains a {knowledge} placeholder:
The {knowledge} placeholder is replaced with the formatted chunks string via kb_prompt() function.
The prompt explicitly instructs the LLM to use citation markers:
[ID: n] where n is the chunk index[ID: 0][ID: 3] for multiple chunksSources: api/db/services/dialog_service.py555-620 rag/prompts/generator.py (kb_prompt, citation_prompt functions)
For non-streaming responses, the system calls chat_mdl.async_chat() with the assembled prompt:
Flow:
Sources: api/db/services/dialog_service.py555-680 rag/llm/chat_model.py473-485
For streaming responses, the system uses chat_mdl.async_chat_streamly() which yields answer deltas:
Streaming Behavior:
<think> tagsSources: api/db/services/dialog_service.py625-680 rag/llm/chat_model.py173-194
The expected citation format is [ID: n] where n is the chunk index. However, LLMs may generate malformed citations that need repair.
The repair_bad_citation_formats() function normalizes various citation patterns:
Repair Process:
Examples:
| Original | Repaired | Valid |
|---|---|---|
(ID: 5) | [ID:5] | ✓ if 5 < len(chunks) |
【ID: 3】 | [ID:3] | ✓ if 3 < len(chunks) |
REF 7 | [ID:7] | ✓ if 7 < len(chunks) |
[ID: 999] | (unchanged) | ✗ out of bounds |
Validation: Only chunk IDs that exist in kbinfos["chunks"] are added to the citation index set.
Sources: api/db/services/dialog_service.py374-408
After citation repair, the system extracts quoted chunks to build the reference structure.
As citations are repaired, their indices are collected into an idx set:
The final reference structure contains only chunks that were cited:
Reference Dictionary Structure:
Sources: api/db/services/dialog_service.py645-680 api/apps/sdk/session.py365-450
When include_reference_metadata=True is specified, the system enriches each chunk with document-level metadata.
Document metadata is stored in Elasticsearch/Infinity in the ragflow_doc_meta_{tenant_id} index:
Field Filtering: If reference_metadata.fields is specified, only those fields are included:
Sources: api/apps/sdk/session.py273-281 api/apps/sdk/session.py365-450 api/db/services/doc_metadata_service.py
Returns a complete response object with answer and reference:
Code Path: api/db/services/dialog_service.py625-680 api/apps/sdk/session.py453-481
Uses Server-Sent Events to stream answer deltas:
Event Sequence:
Stream Format:
data: {"answer": "The ", "reference": {}, "final": false}
data: {"answer": "revenue ", "reference": {}, "final": false}
data: {"answer": "was $1.2B [ID:0]", "reference": {}, "final": false}
data: {"answer": "", "reference": {"total": 1, "chunks": [...]}, "final": true}
Note: Reference structure appears only in the final chunk when final: true.
Sources: api/apps/sdk/session.py332-416 api/db/services/conversation_service.py150-200
RAGFlow provides an OpenAI-compatible API endpoint that formats responses to match OpenAI's chat completion schema.
Note: In streaming mode, the reference field is currently not included in the OpenAI-compatible format. Use non-streaming mode with with_raw_response to retrieve references.
Sources: api/apps/sdk/session.py180-450
The frontend uses citation markers to link answer segments to source chunks.
The UI parses ##n$$ markers to create clickable references:
Answer text ##0$$ more text ##1$$##2$$ conclusion.
Maps to:
##0$$ → chunk at index 0##1$$ → chunk at index 1##2$$ → chunk at index 2The rmPrefix() function converts [ID:n] citations to ##n$$ markers:
Flow:
Sources: rag/app/qa.py (rmPrefix function), api/apps/sdk/session.py365-450
When no chunks are retrieved (kbinfos["total"] == 0), the system falls back to chat-only mode:
Sources: api/db/services/dialog_service.py414-417
If LLM response is truncated due to token limits:
Sources: rag/llm/chat_model.py59-60 rag/llm/chat_model.py166-170
Citations referencing non-existent chunks are preserved but not tracked:
This prevents errors while maintaining answer integrity.
Sources: api/db/services/dialog_service.py382-408
The response generation and citation system implements a complete pipeline:
[ID:n]This architecture ensures that every statement in generated answers can be traced back to specific source chunks, enabling transparent and verifiable RAG responses.
Refresh this wiki