Response Generation and Citations

Relevant source files

Purpose and Scope

This page documents how RAGFlow generates responses from retrieved document chunks and embeds citations to source materials. After the retrieval and reranking pipeline produces relevant chunks (9.3), the system must format these chunks for LLM consumption, generate a coherent answer, insert citations linking answer segments to source chunks, and structure reference metadata for display.

For information about the preceding retrieval and ranking steps, see Reranking and Filtering. For details on how LLM models are invoked, see LLMBundle and Model Types.

System Overview

The response generation pipeline transforms ranked chunks into a cited answer through five key stages:

Sources: api/db/services/dialog_service.py411-680

Chunk Retrieval and Aggregation

Before response generation begins, the system retrieves chunks via the hybrid search mechanism. The retrieval result contains:

Field	Type	Description
`total`	int	Total matching chunks
`ids`	list[str]	Chunk IDs in ranked order
`field`	dict	Chunk content and metadata
`highlight`	dict	Highlighted keywords in content
`aggregation`	list/dict	Document-level aggregations

Key Data Structure: The retrieval returns a kbinfos dictionary with two main components:

Sources: api/db/services/dialog_service.py516-680 rag/nlp/search.py42-51 rag/nlp/search.py158-168

Chunk Formatting for LLM Context

The chunks_format function transforms raw chunks into a structured knowledge string for LLM consumption. This formatting is critical for enabling the LLM to generate proper citations.

Formatting Strategy

Chunks are formatted with numeric IDs that the LLM can reference in its answer:

Format Examples:

For quote=True (default):

[ID: 0] First chunk content here...
[ID: 1] Second chunk content here...
[ID: 2] Third chunk content here...

For quote=False:

0. First chunk content here...
1. Second chunk content here...
2. Third chunk content here...

The function handles token limits by truncating chunks to fit within max_tokens (typically determined by LLM context window minus prompt overhead).

Sources: api/db/services/dialog_service.py555-650 rag/prompts/generator.py (referenced via import)

Prompt Assembly

The complete prompt sent to the LLM consists of three parts:

System Prompt Template

The system prompt is defined in dialog.prompt_config.system and contains a {knowledge} placeholder:

The {knowledge} placeholder is replaced with the formatted chunks string via kb_prompt() function.

Citation Instructions

The prompt explicitly instructs the LLM to use citation markers:

Standard format: [ID: n] where n is the chunk index
Multiple sources: [ID: 0][ID: 3] for multiple chunks
Inline citations: Place citations immediately after relevant statements

Sources: api/db/services/dialog_service.py555-620 rag/prompts/generator.py (kb_prompt, citation_prompt functions)

Answer Generation

Synchronous Generation

For non-streaming responses, the system calls chat_mdl.async_chat() with the assembled prompt:

Flow:

Sources: api/db/services/dialog_service.py555-680 rag/llm/chat_model.py473-485

Streaming Generation

For streaming responses, the system uses chat_mdl.async_chat_streamly() which yields answer deltas:

Streaming Behavior:

Yields answer fragments as they're generated
Final chunk contains complete reference structure
Reasoning tokens (if present) are wrapped in <think> tags
Token count accumulated across all chunks

Sources: api/db/services/dialog_service.py625-680 rag/llm/chat_model.py173-194

Citation Insertion and Format Repair

Standard Citation Format

The expected citation format is [ID: n] where n is the chunk index. However, LLMs may generate malformed citations that need repair.

Citation Repair Mechanism

The repair_bad_citation_formats() function normalizes various citation patterns:

Repair Process:

Examples:

Original	Repaired	Valid
`(ID: 5)`	`[ID:5]`	✓ if 5 < len(chunks)
`【ID: 3】`	`[ID:3]`	✓ if 3 < len(chunks)
`REF 7`	`[ID:7]`	✓ if 7 < len(chunks)
`[ID: 999]`	(unchanged)	✗ out of bounds

Validation: Only chunk IDs that exist in kbinfos["chunks"] are added to the citation index set.

Sources: api/db/services/dialog_service.py374-408

Quote Extraction

After citation repair, the system extracts quoted chunks to build the reference structure.

Citation Index Tracking

As citations are repaired, their indices are collected into an idx set:

Reference Structure Assembly

The final reference structure contains only chunks that were cited:

Reference Dictionary Structure:

Sources: api/db/services/dialog_service.py645-680 api/apps/sdk/session.py365-450

Document Metadata in References

When include_reference_metadata=True is specified, the system enriches each chunk with document-level metadata.

Metadata Retrieval Flow

Metadata Structure

Document metadata is stored in Elasticsearch/Infinity in the ragflow_doc_meta_{tenant_id} index:

Field Filtering: If reference_metadata.fields is specified, only those fields are included:

Sources: api/apps/sdk/session.py273-281 api/apps/sdk/session.py365-450 api/db/services/doc_metadata_service.py

Streaming vs Non-Streaming Response Patterns

Non-Streaming Response

Returns a complete response object with answer and reference:

Code Path: api/db/services/dialog_service.py625-680 api/apps/sdk/session.py453-481

Streaming Response (SSE)

Uses Server-Sent Events to stream answer deltas:

Event Sequence:

Stream Format:

data: {"answer": "The ", "reference": {}, "final": false}

data: {"answer": "revenue ", "reference": {}, "final": false}

data: {"answer": "was $1.2B [ID:0]", "reference": {}, "final": false}

data: {"answer": "", "reference": {"total": 1, "chunks": [...]}, "final": true}

Note: Reference structure appears only in the final chunk when final: true.

Sources: api/apps/sdk/session.py332-416 api/db/services/conversation_service.py150-200

OpenAI-Compatible Completion Format

RAGFlow provides an OpenAI-compatible API endpoint that formats responses to match OpenAI's chat completion schema.

Response Structure

Streaming Format (OpenAI-compatible)

Note: In streaming mode, the reference field is currently not included in the OpenAI-compatible format. Use non-streaming mode with with_raw_response to retrieve references.

Sources: api/apps/sdk/session.py180-450

Citation Highlighting in UI

The frontend uses citation markers to link answer segments to source chunks.

Frontend Citation Pattern

The UI parses ##n$$ markers to create clickable references:

Answer text ##0$$ more text ##1$$##2$$ conclusion.

Maps to:

##0$$ → chunk at index 0
##1$$ → chunk at index 1
##2$$ → chunk at index 2

Marker Insertion

The rmPrefix() function converts [ID:n] citations to ##n$$ markers:

Flow:

Sources: rag/app/qa.py (rmPrefix function), api/apps/sdk/session.py365-450

Special Cases and Error Handling

Empty Retrieval Results

When no chunks are retrieved (kbinfos["total"] == 0), the system falls back to chat-only mode:

Sources: api/db/services/dialog_service.py414-417

Length Truncation Notification

If LLM response is truncated due to token limits:

Sources: rag/llm/chat_model.py59-60 rag/llm/chat_model.py166-170

Out-of-Bounds Citation Handling

Citations referencing non-existent chunks are preserved but not tracked:

This prevents errors while maintaining answer integrity.

Sources: api/db/services/dialog_service.py382-408

Summary

The response generation and citation system implements a complete pipeline:

Chunk Formatting: Transforms retrieved chunks into indexed knowledge strings
Prompt Assembly: Injects formatted knowledge into system prompts
Answer Generation: Invokes LLM with structured context (streaming or non-streaming)
Citation Repair: Normalizes various citation formats to standard [ID:n]
Quote Extraction: Builds reference structure with only cited chunks
Metadata Enrichment: Optionally attaches document metadata to references
Format Conversion: Provides both native and OpenAI-compatible APIs

This architecture ensures that every statement in generated answers can be traced back to specific source chunks, enabling transparent and verifiable RAG responses.

Response Generation and Citations

Relevant source files

Purpose and Scope

For information about the preceding retrieval and ranking steps, see Reranking and Filtering. For details on how LLM models are invoked, see LLMBundle and Model Types.

System Overview

The response generation pipeline transforms ranked chunks into a cited answer through five key stages:

Sources: api/db/services/dialog_service.py411-680

Chunk Retrieval and Aggregation

Before response generation begins, the system retrieves chunks via the hybrid search mechanism. The retrieval result contains:

Field	Type	Description
`total`	int	Total matching chunks
`ids`	list[str]	Chunk IDs in ranked order
`field`	dict	Chunk content and metadata
`highlight`	dict	Highlighted keywords in content
`aggregation`	list/dict	Document-level aggregations

Key Data Structure: The retrieval returns a kbinfos dictionary with two main components:

Sources: api/db/services/dialog_service.py516-680 rag/nlp/search.py42-51 rag/nlp/search.py158-168

Chunk Formatting for LLM Context

The chunks_format function transforms raw chunks into a structured knowledge string for LLM consumption. This formatting is critical for enabling the LLM to generate proper citations.

Formatting Strategy

Chunks are formatted with numeric IDs that the LLM can reference in its answer:

Format Examples:

For quote=True (default):

[ID: 0] First chunk content here...
[ID: 1] Second chunk content here...
[ID: 2] Third chunk content here...

For quote=False:

0. First chunk content here...
1. Second chunk content here...
2. Third chunk content here...

The function handles token limits by truncating chunks to fit within max_tokens (typically determined by LLM context window minus prompt overhead).

Sources: api/db/services/dialog_service.py555-650 rag/prompts/generator.py (referenced via import)

Prompt Assembly

The complete prompt sent to the LLM consists of three parts:

System Prompt Template

The system prompt is defined in dialog.prompt_config.system and contains a {knowledge} placeholder:

The {knowledge} placeholder is replaced with the formatted chunks string via kb_prompt() function.

Citation Instructions

The prompt explicitly instructs the LLM to use citation markers:

Standard format: [ID: n] where n is the chunk index
Multiple sources: [ID: 0][ID: 3] for multiple chunks
Inline citations: Place citations immediately after relevant statements

Sources: api/db/services/dialog_service.py555-620 rag/prompts/generator.py (kb_prompt, citation_prompt functions)

Answer Generation

Synchronous Generation

For non-streaming responses, the system calls chat_mdl.async_chat() with the assembled prompt:

Flow:

Sources: api/db/services/dialog_service.py555-680 rag/llm/chat_model.py473-485

Streaming Generation

For streaming responses, the system uses chat_mdl.async_chat_streamly() which yields answer deltas:

Streaming Behavior:

Yields answer fragments as they're generated
Final chunk contains complete reference structure
Reasoning tokens (if present) are wrapped in <think> tags
Token count accumulated across all chunks

Sources: api/db/services/dialog_service.py625-680 rag/llm/chat_model.py173-194

Citation Insertion and Format Repair

Standard Citation Format

The expected citation format is [ID: n] where n is the chunk index. However, LLMs may generate malformed citations that need repair.

Citation Repair Mechanism

The repair_bad_citation_formats() function normalizes various citation patterns:

Repair Process:

Examples:

Original	Repaired	Valid
`(ID: 5)`	`[ID:5]`	✓ if 5 < len(chunks)
`【ID: 3】`	`[ID:3]`	✓ if 3 < len(chunks)
`REF 7`	`[ID:7]`	✓ if 7 < len(chunks)
`[ID: 999]`	(unchanged)	✗ out of bounds

Validation: Only chunk IDs that exist in kbinfos["chunks"] are added to the citation index set.

Sources: api/db/services/dialog_service.py374-408

Quote Extraction

After citation repair, the system extracts quoted chunks to build the reference structure.

Citation Index Tracking

As citations are repaired, their indices are collected into an idx set:

Reference Structure Assembly

The final reference structure contains only chunks that were cited:

Reference Dictionary Structure:

Sources: api/db/services/dialog_service.py645-680 api/apps/sdk/session.py365-450

Document Metadata in References

When include_reference_metadata=True is specified, the system enriches each chunk with document-level metadata.

Metadata Retrieval Flow

Metadata Structure

Document metadata is stored in Elasticsearch/Infinity in the ragflow_doc_meta_{tenant_id} index:

Field Filtering: If reference_metadata.fields is specified, only those fields are included:

Sources: api/apps/sdk/session.py273-281 api/apps/sdk/session.py365-450 api/db/services/doc_metadata_service.py

Streaming vs Non-Streaming Response Patterns

Non-Streaming Response

Returns a complete response object with answer and reference:

Code Path: api/db/services/dialog_service.py625-680 api/apps/sdk/session.py453-481

Streaming Response (SSE)

Uses Server-Sent Events to stream answer deltas:

Event Sequence:

Stream Format:

data: {"answer": "The ", "reference": {}, "final": false}

data: {"answer": "revenue ", "reference": {}, "final": false}

data: {"answer": "was $1.2B [ID:0]", "reference": {}, "final": false}

data: {"answer": "", "reference": {"total": 1, "chunks": [...]}, "final": true}

Note: Reference structure appears only in the final chunk when final: true.

Sources: api/apps/sdk/session.py332-416 api/db/services/conversation_service.py150-200

OpenAI-Compatible Completion Format

RAGFlow provides an OpenAI-compatible API endpoint that formats responses to match OpenAI's chat completion schema.

Response Structure

Streaming Format (OpenAI-compatible)

Note: In streaming mode, the reference field is currently not included in the OpenAI-compatible format. Use non-streaming mode with with_raw_response to retrieve references.

Sources: api/apps/sdk/session.py180-450

Citation Highlighting in UI

The frontend uses citation markers to link answer segments to source chunks.

Frontend Citation Pattern

The UI parses ##n$$ markers to create clickable references:

Answer text ##0$$ more text ##1$$##2$$ conclusion.

Maps to:

##0$$ → chunk at index 0
##1$$ → chunk at index 1
##2$$ → chunk at index 2

Marker Insertion

The rmPrefix() function converts [ID:n] citations to ##n$$ markers:

Flow:

Sources: rag/app/qa.py (rmPrefix function), api/apps/sdk/session.py365-450

Special Cases and Error Handling

Empty Retrieval Results

When no chunks are retrieved (kbinfos["total"] == 0), the system falls back to chat-only mode:

Sources: api/db/services/dialog_service.py414-417

Length Truncation Notification

If LLM response is truncated due to token limits:

Sources: rag/llm/chat_model.py59-60 rag/llm/chat_model.py166-170

Out-of-Bounds Citation Handling

Citations referencing non-existent chunks are preserved but not tracked:

This prevents errors while maintaining answer integrity.

Sources: api/db/services/dialog_service.py382-408

Summary

The response generation and citation system implements a complete pipeline:

Chunk Formatting: Transforms retrieved chunks into indexed knowledge strings
Prompt Assembly: Injects formatted knowledge into system prompts
Answer Generation: Invokes LLM with structured context (streaming or non-streaming)
Citation Repair: Normalizes various citation formats to standard [ID:n]
Quote Extraction: Builds reference structure with only cited chunks
Metadata Enrichment: Optionally attaches document metadata to references
Format Conversion: Provides both native and OpenAI-compatible APIs

This architecture ensures that every statement in generated answers can be traced back to specific source chunks, enabling transparent and verifiable RAG responses.

Response Generation and Citations

Purpose and Scope

System Overview

Chunk Retrieval and Aggregation

Chunk Formatting for LLM Context

Formatting Strategy

Prompt Assembly

System Prompt Template

Citation Instructions

Answer Generation

Synchronous Generation

Streaming Generation

Citation Insertion and Format Repair

Standard Citation Format

Citation Repair Mechanism

Quote Extraction

Citation Index Tracking

Reference Structure Assembly

Document Metadata in References

Metadata Retrieval Flow

Metadata Structure

Streaming vs Non-Streaming Response Patterns

Non-Streaming Response

Streaming Response (SSE)

OpenAI-Compatible Completion Format

Response Structure

Streaming Format (OpenAI-compatible)

Citation Highlighting in UI

Frontend Citation Pattern

Marker Insertion

Special Cases and Error Handling

Empty Retrieval Results

Length Truncation Notification

Out-of-Bounds Citation Handling

Summary

On this page

Response Generation and Citations

Purpose and Scope

System Overview

Chunk Retrieval and Aggregation

Chunk Formatting for LLM Context

Formatting Strategy

Prompt Assembly

System Prompt Template

Citation Instructions

Answer Generation

Synchronous Generation

Streaming Generation

Citation Insertion and Format Repair

Standard Citation Format

Citation Repair Mechanism

Quote Extraction

Citation Index Tracking

Reference Structure Assembly

Document Metadata in References

Metadata Retrieval Flow

Metadata Structure

Streaming vs Non-Streaming Response Patterns

Non-Streaming Response

Streaming Response (SSE)

OpenAI-Compatible Completion Format

Response Structure

Streaming Format (OpenAI-compatible)

Citation Highlighting in UI

Frontend Citation Pattern

Marker Insertion

Special Cases and Error Handling

Empty Retrieval Results

Length Truncation Notification

Out-of-Bounds Citation Handling

Summary

On this page