Bucket Tool & File Context

Relevant source files

This page documents the bucket system for file uploads, document processing, and context injection into chat messages. The bucket tool allows users to upload files (PDFs, documents, spreadsheets, etc.) that are then processed and injected as context into AI conversations.

For information about tool orchestration and execution, see Tool Orchestration. For web search integration, see Web Search Integration.

Purpose and Architecture

The bucket system provides temporary file storage with automatic text extraction and context injection. Files uploaded to a bucket are processed into markdown/text format, cached for performance, and can be referenced in chat messages via a special JSON notation that gets replaced with the actual file content.

Key capabilities:

Multi-format document processing (PDF, DOCX, XLSX, HTML, etc.)
URL downloading with recursive link following
Content caching to avoid reprocessing
Streaming file content to reduce memory usage
Automatic cleanup of processed files
Media file handling (images, videos) with separate storage

Sources: g4f/tools/files.py1-600 g4f/tools/run_tools.py32-110

System Architecture

Diagram: Bucket System Data Flow

The system processes uploads through format-specific extractors, caches the results, and injects content into messages when referenced.

Sources: g4f/tools/files.py88-320 g4f/tools/run_tools.py88-110 g4f/gui/server/backend_api.py386-485

Bucket Directory Structure

Each bucket is stored in a unique directory under ./har_and_cookies/buckets/:

./har_and_cookies/buckets/<bucket_id>/
├── files.txt                    # List of uploaded filenames
├── downloads.json               # URLs to download (optional)
├── plain.cache                  # Full processed text cache
├── plain_0001.cache            # Text chunks (if split)
├── plain_0002.cache
├── spacy_0001.cache            # Refined chunks (if spacy used)
├── document.pdf                # Original uploaded files
├── document.pdf.md             # Markdown conversion
├── spreadsheet.xlsx
├── media/
│   ├── image1.jpg              # Uploaded images
│   └── image2.png
└── thumbnail/
    ├── image1.jpg              # Generated thumbnails
    └── image2.png

Key files:

files.txt: Newline-separated list of files to process (excludes media)
downloads.json: Array of {"urls": [...], "max_depth": 0} objects
plain.cache: Combined text from all processed files
plain_NNNN.cache: Text split into ~1MB chunks for large datasets
spacy_NNNN.cache: NLP-refined chunks (optional)

Sources: g4f/tools/files.py84-87 g4f/files11-31 g4f/gui/server/backend_api.py413-485

File Upload Flow

Diagram: File Upload Sequence

Sources: g4f/gui/server/backend_api.py413-485

Document Processing

Supported File Formats

Format	Extensions	Libraries	Description
PDF	`.pdf`	PyPDF2, pdfplumber, pdfminer	Text extraction with fallback chain
Word	`.docx`	python-docx, docx2txt	Paragraph extraction
OpenDocument	`.odt`	odfpy	ODF text document support
EPUB	`.epub`	ebooklib	E-book document extraction
Excel	`.xlsx`	pandas, openpyxl	Row-wise text concatenation
HTML	`.html`	BeautifulSoup4	Scraped text content
Zip	`.zip`	zipfile	Recursive extraction and processing
Plain Text	`.txt`, `.md`, `.json`, `.py`, `.js`, `.xml`, etc.	Built-in	Direct text read (see `PLAIN_FILE_EXTENSIONS`)

Sources: g4f/tools/files.py84-123 g4f/tools/files.py149-213

Processing Pipeline

The stream_read_files() function processes each file type:

PDF Processing (with fallback chain):

DOCX Processing:

XLSX Processing:

Each file is wrapped with metadata markers:

<!-- File: document.pdf -->
[extracted text content]
<-- End -->

Sources: g4f/tools/files.py149-213

MarkItDown Integration

For comprehensive document conversion, MarkItDown is used as a universal converter during upload:

This handles formats like PPTX, images with OCR, audio transcription, and more.

Sources: g4f/gui/server/backend_api.py429-446

Context Injection via Bucket Tool

The bucket tool is invoked during message processing to inject file content. It operates on the bucket_tool function name defined in TOOL_NAMES:

Injection Mechanism

Step 1: Detect Bucket References

Messages are scanned for the pattern {"bucket_id": "xxx"}:

Step 2: Replace with Content

The read_bucket() function retrieves cached text:

Step 3: Add Citation Instructions

If the message contains "Source:" markers (from URL downloads or file metadata), citation instructions are appended:

Sources: g4f/tools/run_tools.py32-110

Example Usage

Input message:

After bucket tool processing:

Sources: g4f/tools/run_tools.py88-110

File Streaming API

Stream Endpoint

Route: GET /backend-api/v2/files/<bucket_id>/stream

Returns Server-Sent Events (SSE) with processing status and content chunks.

Event types:

download: URL download progress
load: File loading and caching progress
refine: spaCy refinement progress (if enabled)
media: Media file notification
delete_files: Cleanup completion
done: Final size report
error: Error message

Query parameters:

delete_files (default: true): Delete source files after processing
refine_chunks_with_spacy (default: false): Apply NLP refinement

Sources: g4f/gui/server/backend_api.py386-411

Streaming Implementation

Diagram: Streaming Data Flow

Core streaming function:

Sources: g4f/tools/files.py559-568

Caching Strategy

The cache_stream() function implements write-through caching:

Check for existing cache: If plain.cache exists, stream from cache
First-time processing: Write to temporary file while yielding chunks
Atomic rename: Move temp file to plain.cache after completion

Sources: g4f/tools/files.py215-226

URL Downloads

The bucket system can download external URLs and add them to the file collection.

Download Configuration

Create a downloads.json file in the bucket directory:

Parameters:

urls or url: List of URLs or single URL to download
max_depth (default: 0): Recursive link following depth
delay (default: 3): Seconds between requests
group_size (default: 5): Parallel download batch size
timeout (default: 10): Request timeout in seconds

Sources: g4f/tools/files.py490-502

Download Process

Filename generation:

Sources: g4f/tools/files.py413-488 g4f/tools/files.py321-326

Link Extraction

For HTML files, links are extracted for recursive downloading:

Sources: g4f/tools/files.py390-411

Advanced Features

File Chunking for Large Documents

When plain.cache exceeds ~1MB, it's automatically split into chunks:

Completeness check (ensures code blocks are not split):

Sources: g4f/tools/files.py286-319 g4f/tools/files.py228-229

Optional NLP-based text refinement for better context quality:

Usage: Set refine_chunks_with_spacy=true in the stream request.

The refined text is cached separately in spacy_NNNN.cache files, which are prioritized over plain cache during reads.

Sources: g4f/tools/files.py124-140 g4f/tools/files.py262-284

Media File Handling

Media files (images, videos) are stored separately from documents:

Upload Process

File validation: Check is_allowed_extension(filename)
Image processing (if Pillow available):
- Extract dimensions
- Generate thumbnail (400x400 max)
- Store EXIF-corrected orientation
Storage paths:
- Original: bucket_dir/media/filename
- Thumbnail: bucket_dir/thumbnail/filename

Retrieval Endpoint

Route: GET /files/<bucket_id>/<file_type>/<filename>

file_type options:

media: Full-size media
thumbnail: Generated thumbnail (with fallback to media)

Query parameters:

url=<source_url>: Original source URL (encoded)

Sources: g4f/gui/server/backend_api.py487-504 g4f/gui/server/backend_api.py413-485

Bucket Management Endpoints

Create/Upload Files

Endpoint: POST /backend-api/v2/files/<bucket_id>

Request: multipart/form-data with files[] field

Response:

Sources: g4f/gui/server/backend_api.py413-485

Stream Files

Endpoint: GET /backend-api/v2/files/<bucket_id>/stream

Query parameters:

delete_files: Boolean (default: true)
refine_chunks_with_spacy: Boolean (default: false)

Response: text/event-stream

data: {"action": "download", "count": 1}

data: {"action": "load", "size": 4096}

data: {"action": "media", "filename": "image.jpg"}

data: {"action": "done", "size": 1048576}

Sources: g4f/gui/server/backend_api.py386-411

Get/Delete Bucket

Endpoints:

GET /backend-api/v2/files/<bucket_id>: Stream content (non-SSE)
DELETE /backend-api/v2/files/<bucket_id>: Delete entire bucket

DELETE response:

Sources: g4f/gui/server/backend_api.py390-405

Integration with Chat System

Tool Call Format

The bucket tool is invoked via the standard tool calling mechanism:

When bucket_tool is in the tool calls list, ToolHandler.process_bucket_tool() is invoked automatically during message processing in async_iter_run_tools() and iter_run_tools().

Sources: g4f/tools/run_tools.py113-145

Complete Request Flow

Diagram: Complete Bucket Integration Flow

Sources: g4f/tools/run_tools.py236-346 g4f/tools/run_tools.py88-110 g4f/gui/server/api.py178-311

Code Entity Reference

Key Classes and Functions

Entity	Location	Purpose
`get_bucket_dir()`	g4f/files11-31	Resolve bucket directory path
`supports_filename()`	g4f/tools/files.py89-122	Check if file format is supported
`stream_read_files()`	g4f/tools/files.py149-213	Extract text from various file formats
`read_bucket()`	g4f/tools/files.py246-260	Read cached bucket content
`cache_stream()`	g4f/tools/files.py215-226	Write-through cache implementation
`split_file_by_size_and_newline()`	g4f/tools/files.py286-319	Split large files into chunks
`download_urls()`	g4f/tools/files.py413-488	Async URL downloading with recursion
`get_streaming()`	g4f/tools/files.py559-568	Main streaming orchestrator
`ToolHandler.process_bucket_tool()`	g4f/tools/run_tools.py88-110	Inject bucket content into messages
`BUCKET_INSTRUCTIONS`	g4f/tools/run_tools.py32-34	Citation format instructions
`upload_files()`	g4f/gui/server/backend_api.py413-485	File upload endpoint handler
`manage_files()`	g4f/gui/server/backend_api.py390-411	Streaming/deletion endpoint handler

Sources: All files referenced in table

Dependencies

Required:

aiofile: Async file I/O (optional, for concurrent writes)
aiohttp: HTTP client for URL downloads

Document Processing (at least one per format):

PyPDF2 or pdfplumber or pdfminer: PDF extraction
python-docx or docx2txt: DOCX extraction
odfpy: ODT support
ebooklib: EPUB support
pandas + openpyxl: XLSX support
BeautifulSoup4: HTML scraping

Optional:

spacy + en_core_web_sm model: Text refinement
markitdown: Universal document conversion
Pillow: Image processing and thumbnails

Install all file processing dependencies:

Sources: g4f/tools/files.py17-74

Performance Considerations

Caching: Processed text is cached in plain.cache to avoid reprocessing
Streaming: Large files are streamed in chunks rather than loaded entirely into memory
Chunking: Files >1MB are split for more manageable processing
Lazy loading: Content is only read from disk when referenced
Async downloads: URLs are fetched concurrently with configurable batch size
Thumbnail generation: Images are downsampled to 400x400 for faster loading

Sources: g4f/tools/files.py215-319 g4f/tools/files.py413-488

Bucket Tool & File Context

Relevant source files

For information about tool orchestration and execution, see Tool Orchestration. For web search integration, see Web Search Integration.

Purpose and Architecture

Key capabilities:

Multi-format document processing (PDF, DOCX, XLSX, HTML, etc.)
URL downloading with recursive link following
Content caching to avoid reprocessing
Streaming file content to reduce memory usage
Automatic cleanup of processed files
Media file handling (images, videos) with separate storage

Sources: g4f/tools/files.py1-600 g4f/tools/run_tools.py32-110

System Architecture

Diagram: Bucket System Data Flow

The system processes uploads through format-specific extractors, caches the results, and injects content into messages when referenced.

Sources: g4f/tools/files.py88-320 g4f/tools/run_tools.py88-110 g4f/gui/server/backend_api.py386-485

Bucket Directory Structure

Each bucket is stored in a unique directory under ./har_and_cookies/buckets/:

./har_and_cookies/buckets/<bucket_id>/
├── files.txt                    # List of uploaded filenames
├── downloads.json               # URLs to download (optional)
├── plain.cache                  # Full processed text cache
├── plain_0001.cache            # Text chunks (if split)
├── plain_0002.cache
├── spacy_0001.cache            # Refined chunks (if spacy used)
├── document.pdf                # Original uploaded files
├── document.pdf.md             # Markdown conversion
├── spreadsheet.xlsx
├── media/
│   ├── image1.jpg              # Uploaded images
│   └── image2.png
└── thumbnail/
    ├── image1.jpg              # Generated thumbnails
    └── image2.png

Key files:

files.txt: Newline-separated list of files to process (excludes media)
downloads.json: Array of {"urls": [...], "max_depth": 0} objects
plain.cache: Combined text from all processed files
plain_NNNN.cache: Text split into ~1MB chunks for large datasets
spacy_NNNN.cache: NLP-refined chunks (optional)

Sources: g4f/tools/files.py84-87 g4f/files11-31 g4f/gui/server/backend_api.py413-485

File Upload Flow

Diagram: File Upload Sequence

Sources: g4f/gui/server/backend_api.py413-485

Document Processing

Supported File Formats

Format	Extensions	Libraries	Description
PDF	`.pdf`	PyPDF2, pdfplumber, pdfminer	Text extraction with fallback chain
Word	`.docx`	python-docx, docx2txt	Paragraph extraction
OpenDocument	`.odt`	odfpy	ODF text document support
EPUB	`.epub`	ebooklib	E-book document extraction
Excel	`.xlsx`	pandas, openpyxl	Row-wise text concatenation
HTML	`.html`	BeautifulSoup4	Scraped text content
Zip	`.zip`	zipfile	Recursive extraction and processing
Plain Text	`.txt`, `.md`, `.json`, `.py`, `.js`, `.xml`, etc.	Built-in	Direct text read (see `PLAIN_FILE_EXTENSIONS`)

Sources: g4f/tools/files.py84-123 g4f/tools/files.py149-213

Processing Pipeline

The stream_read_files() function processes each file type:

PDF Processing (with fallback chain):

DOCX Processing:

XLSX Processing:

Each file is wrapped with metadata markers:

<!-- File: document.pdf -->
[extracted text content]
<-- End -->

Sources: g4f/tools/files.py149-213

MarkItDown Integration

For comprehensive document conversion, MarkItDown is used as a universal converter during upload:

This handles formats like PPTX, images with OCR, audio transcription, and more.

Sources: g4f/gui/server/backend_api.py429-446

Context Injection via Bucket Tool

The bucket tool is invoked during message processing to inject file content. It operates on the bucket_tool function name defined in TOOL_NAMES:

Injection Mechanism

Step 1: Detect Bucket References

Messages are scanned for the pattern {"bucket_id": "xxx"}:

Step 2: Replace with Content

The read_bucket() function retrieves cached text:

Step 3: Add Citation Instructions

If the message contains "Source:" markers (from URL downloads or file metadata), citation instructions are appended:

Sources: g4f/tools/run_tools.py32-110

Example Usage

Input message:

After bucket tool processing:

Sources: g4f/tools/run_tools.py88-110

File Streaming API

Stream Endpoint

Route: GET /backend-api/v2/files/<bucket_id>/stream

Returns Server-Sent Events (SSE) with processing status and content chunks.

Event types:

download: URL download progress
load: File loading and caching progress
refine: spaCy refinement progress (if enabled)
media: Media file notification
delete_files: Cleanup completion
done: Final size report
error: Error message

Query parameters:

delete_files (default: true): Delete source files after processing
refine_chunks_with_spacy (default: false): Apply NLP refinement

Sources: g4f/gui/server/backend_api.py386-411

Streaming Implementation

Diagram: Streaming Data Flow

Core streaming function:

Sources: g4f/tools/files.py559-568

Caching Strategy

The cache_stream() function implements write-through caching:

Check for existing cache: If plain.cache exists, stream from cache
First-time processing: Write to temporary file while yielding chunks
Atomic rename: Move temp file to plain.cache after completion

Sources: g4f/tools/files.py215-226

URL Downloads

The bucket system can download external URLs and add them to the file collection.

Download Configuration

Create a downloads.json file in the bucket directory:

Parameters:

urls or url: List of URLs or single URL to download
max_depth (default: 0): Recursive link following depth
delay (default: 3): Seconds between requests
group_size (default: 5): Parallel download batch size
timeout (default: 10): Request timeout in seconds

Sources: g4f/tools/files.py490-502

Download Process

Filename generation:

Sources: g4f/tools/files.py413-488 g4f/tools/files.py321-326

Link Extraction

For HTML files, links are extracted for recursive downloading:

Sources: g4f/tools/files.py390-411

Advanced Features

File Chunking for Large Documents

When plain.cache exceeds ~1MB, it's automatically split into chunks:

Completeness check (ensures code blocks are not split):

Sources: g4f/tools/files.py286-319 g4f/tools/files.py228-229

Optional NLP-based text refinement for better context quality:

Usage: Set refine_chunks_with_spacy=true in the stream request.

The refined text is cached separately in spacy_NNNN.cache files, which are prioritized over plain cache during reads.

Sources: g4f/tools/files.py124-140 g4f/tools/files.py262-284

Media File Handling

Media files (images, videos) are stored separately from documents:

Upload Process

File validation: Check is_allowed_extension(filename)
Image processing (if Pillow available):
- Extract dimensions
- Generate thumbnail (400x400 max)
- Store EXIF-corrected orientation
Storage paths:
- Original: bucket_dir/media/filename
- Thumbnail: bucket_dir/thumbnail/filename

Retrieval Endpoint

Route: GET /files/<bucket_id>/<file_type>/<filename>

file_type options:

media: Full-size media
thumbnail: Generated thumbnail (with fallback to media)

Query parameters:

url=<source_url>: Original source URL (encoded)

Sources: g4f/gui/server/backend_api.py487-504 g4f/gui/server/backend_api.py413-485

Bucket Management Endpoints

Create/Upload Files

Endpoint: POST /backend-api/v2/files/<bucket_id>

Request: multipart/form-data with files[] field

Response:

Sources: g4f/gui/server/backend_api.py413-485

Stream Files

Endpoint: GET /backend-api/v2/files/<bucket_id>/stream

Query parameters:

delete_files: Boolean (default: true)
refine_chunks_with_spacy: Boolean (default: false)

Response: text/event-stream

data: {"action": "download", "count": 1}

data: {"action": "load", "size": 4096}

data: {"action": "media", "filename": "image.jpg"}

data: {"action": "done", "size": 1048576}

Sources: g4f/gui/server/backend_api.py386-411

Get/Delete Bucket

Endpoints:

GET /backend-api/v2/files/<bucket_id>: Stream content (non-SSE)
DELETE /backend-api/v2/files/<bucket_id>: Delete entire bucket

DELETE response:

Sources: g4f/gui/server/backend_api.py390-405

Integration with Chat System

Tool Call Format

The bucket tool is invoked via the standard tool calling mechanism:

When bucket_tool is in the tool calls list, ToolHandler.process_bucket_tool() is invoked automatically during message processing in async_iter_run_tools() and iter_run_tools().

Sources: g4f/tools/run_tools.py113-145

Complete Request Flow

Diagram: Complete Bucket Integration Flow

Sources: g4f/tools/run_tools.py236-346 g4f/tools/run_tools.py88-110 g4f/gui/server/api.py178-311

Code Entity Reference

Key Classes and Functions

Entity	Location	Purpose
`get_bucket_dir()`	g4f/files11-31	Resolve bucket directory path
`supports_filename()`	g4f/tools/files.py89-122	Check if file format is supported
`stream_read_files()`	g4f/tools/files.py149-213	Extract text from various file formats
`read_bucket()`	g4f/tools/files.py246-260	Read cached bucket content
`cache_stream()`	g4f/tools/files.py215-226	Write-through cache implementation
`split_file_by_size_and_newline()`	g4f/tools/files.py286-319	Split large files into chunks
`download_urls()`	g4f/tools/files.py413-488	Async URL downloading with recursion
`get_streaming()`	g4f/tools/files.py559-568	Main streaming orchestrator
`ToolHandler.process_bucket_tool()`	g4f/tools/run_tools.py88-110	Inject bucket content into messages
`BUCKET_INSTRUCTIONS`	g4f/tools/run_tools.py32-34	Citation format instructions
`upload_files()`	g4f/gui/server/backend_api.py413-485	File upload endpoint handler
`manage_files()`	g4f/gui/server/backend_api.py390-411	Streaming/deletion endpoint handler

Sources: All files referenced in table

Dependencies

Required:

aiofile: Async file I/O (optional, for concurrent writes)
aiohttp: HTTP client for URL downloads

Document Processing (at least one per format):

PyPDF2 or pdfplumber or pdfminer: PDF extraction
python-docx or docx2txt: DOCX extraction
odfpy: ODT support
ebooklib: EPUB support
pandas + openpyxl: XLSX support
BeautifulSoup4: HTML scraping

Optional:

spacy + en_core_web_sm model: Text refinement
markitdown: Universal document conversion
Pillow: Image processing and thumbnails

Install all file processing dependencies:

Sources: g4f/tools/files.py17-74

Performance Considerations

Caching: Processed text is cached in plain.cache to avoid reprocessing
Streaming: Large files are streamed in chunks rather than loaded entirely into memory
Chunking: Files >1MB are split for more manageable processing
Lazy loading: Content is only read from disk when referenced
Async downloads: URLs are fetched concurrently with configurable batch size
Thumbnail generation: Images are downsampled to 400x400 for faster loading

Sources: g4f/tools/files.py215-319 g4f/tools/files.py413-488

Bucket Tool & File Context

Purpose and Architecture

System Architecture

Bucket Directory Structure

File Upload Flow

Document Processing

Supported File Formats

Processing Pipeline

MarkItDown Integration

Context Injection via Bucket Tool

Injection Mechanism

Example Usage

File Streaming API

Stream Endpoint

Streaming Implementation

Caching Strategy

URL Downloads

Download Configuration

Download Process

Link Extraction

Advanced Features

File Chunking for Large Documents

spaCy Refinement

Media File Handling

Upload Process

Retrieval Endpoint

Bucket Management Endpoints

Create/Upload Files

Stream Files

Get/Delete Bucket

Integration with Chat System

Tool Call Format

Complete Request Flow

Code Entity Reference

Key Classes and Functions

Dependencies

Performance Considerations

On this page

Bucket Tool & File Context

Purpose and Architecture

System Architecture

Bucket Directory Structure

File Upload Flow

Document Processing

Supported File Formats

Processing Pipeline

MarkItDown Integration

Context Injection via Bucket Tool

Injection Mechanism

Example Usage

File Streaming API

Stream Endpoint

Streaming Implementation

Caching Strategy

URL Downloads

Download Configuration

Download Process

Link Extraction

Advanced Features

File Chunking for Large Documents

spaCy Refinement

Media File Handling

Upload Process

Retrieval Endpoint

Bucket Management Endpoints

Create/Upload Files

Stream Files

Get/Delete Bucket

Integration with Chat System

Tool Call Format

Complete Request Flow

Code Entity Reference

Key Classes and Functions

Dependencies

Performance Considerations

On this page