This page documents the bucket system for file uploads, document processing, and context injection into chat messages. The bucket tool allows users to upload files (PDFs, documents, spreadsheets, etc.) that are then processed and injected as context into AI conversations.
For information about tool orchestration and execution, see Tool Orchestration. For web search integration, see Web Search Integration.
The bucket system provides temporary file storage with automatic text extraction and context injection. Files uploaded to a bucket are processed into markdown/text format, cached for performance, and can be referenced in chat messages via a special JSON notation that gets replaced with the actual file content.
Key capabilities:
Sources: g4f/tools/files.py1-600 g4f/tools/run_tools.py32-110
Diagram: Bucket System Data Flow
The system processes uploads through format-specific extractors, caches the results, and injects content into messages when referenced.
Sources: g4f/tools/files.py88-320 g4f/tools/run_tools.py88-110 g4f/gui/server/backend_api.py386-485
Each bucket is stored in a unique directory under ./har_and_cookies/buckets/:
./har_and_cookies/buckets/<bucket_id>/
├── files.txt # List of uploaded filenames
├── downloads.json # URLs to download (optional)
├── plain.cache # Full processed text cache
├── plain_0001.cache # Text chunks (if split)
├── plain_0002.cache
├── spacy_0001.cache # Refined chunks (if spacy used)
├── document.pdf # Original uploaded files
├── document.pdf.md # Markdown conversion
├── spreadsheet.xlsx
├── media/
│ ├── image1.jpg # Uploaded images
│ └── image2.png
└── thumbnail/
├── image1.jpg # Generated thumbnails
└── image2.png
Key files:
files.txt: Newline-separated list of files to process (excludes media)downloads.json: Array of {"urls": [...], "max_depth": 0} objectsplain.cache: Combined text from all processed filesplain_NNNN.cache: Text split into ~1MB chunks for large datasetsspacy_NNNN.cache: NLP-refined chunks (optional)Sources: g4f/tools/files.py84-87 g4f/files11-31 g4f/gui/server/backend_api.py413-485
Diagram: File Upload Sequence
Sources: g4f/gui/server/backend_api.py413-485
| Format | Extensions | Libraries | Description |
|---|---|---|---|
.pdf | PyPDF2, pdfplumber, pdfminer | Text extraction with fallback chain | |
| Word | .docx | python-docx, docx2txt | Paragraph extraction |
| OpenDocument | .odt | odfpy | ODF text document support |
| EPUB | .epub | ebooklib | E-book document extraction |
| Excel | .xlsx | pandas, openpyxl | Row-wise text concatenation |
| HTML | .html | BeautifulSoup4 | Scraped text content |
| Zip | .zip | zipfile | Recursive extraction and processing |
| Plain Text | .txt, .md, .json, .py, .js, .xml, etc. | Built-in | Direct text read (see PLAIN_FILE_EXTENSIONS) |
Sources: g4f/tools/files.py84-123 g4f/tools/files.py149-213
The stream_read_files() function processes each file type:
PDF Processing (with fallback chain):
DOCX Processing:
XLSX Processing:
Each file is wrapped with metadata markers:
<!-- File: document.pdf -->
[extracted text content]
<-- End -->
Sources: g4f/tools/files.py149-213
For comprehensive document conversion, MarkItDown is used as a universal converter during upload:
This handles formats like PPTX, images with OCR, audio transcription, and more.
Sources: g4f/gui/server/backend_api.py429-446
The bucket tool is invoked during message processing to inject file content. It operates on the bucket_tool function name defined in TOOL_NAMES:
Step 1: Detect Bucket References
Messages are scanned for the pattern {"bucket_id": "xxx"}:
Step 2: Replace with Content
The read_bucket() function retrieves cached text:
Step 3: Add Citation Instructions
If the message contains "Source:" markers (from URL downloads or file metadata), citation instructions are appended:
Sources: g4f/tools/run_tools.py32-110
Input message:
After bucket tool processing:
Sources: g4f/tools/run_tools.py88-110
Route: GET /backend-api/v2/files/<bucket_id>/stream
Returns Server-Sent Events (SSE) with processing status and content chunks.
Event types:
download: URL download progressload: File loading and caching progressrefine: spaCy refinement progress (if enabled)media: Media file notificationdelete_files: Cleanup completiondone: Final size reporterror: Error messageQuery parameters:
delete_files (default: true): Delete source files after processingrefine_chunks_with_spacy (default: false): Apply NLP refinementSources: g4f/gui/server/backend_api.py386-411
Diagram: Streaming Data Flow
Core streaming function:
Sources: g4f/tools/files.py559-568
The cache_stream() function implements write-through caching:
plain.cache exists, stream from cacheplain.cache after completionSources: g4f/tools/files.py215-226
The bucket system can download external URLs and add them to the file collection.
Create a downloads.json file in the bucket directory:
Parameters:
urls or url: List of URLs or single URL to downloadmax_depth (default: 0): Recursive link following depthdelay (default: 3): Seconds between requestsgroup_size (default: 5): Parallel download batch sizetimeout (default: 10): Request timeout in secondsSources: g4f/tools/files.py490-502
Filename generation:
Sources: g4f/tools/files.py413-488 g4f/tools/files.py321-326
For HTML files, links are extracted for recursive downloading:
Sources: g4f/tools/files.py390-411
When plain.cache exceeds ~1MB, it's automatically split into chunks:
Completeness check (ensures code blocks are not split):
Sources: g4f/tools/files.py286-319 g4f/tools/files.py228-229
Optional NLP-based text refinement for better context quality:
Usage: Set refine_chunks_with_spacy=true in the stream request.
The refined text is cached separately in spacy_NNNN.cache files, which are prioritized over plain cache during reads.
Sources: g4f/tools/files.py124-140 g4f/tools/files.py262-284
Media files (images, videos) are stored separately from documents:
is_allowed_extension(filename)bucket_dir/media/filenamebucket_dir/thumbnail/filenameRoute: GET /files/<bucket_id>/<file_type>/<filename>
file_type options:
media: Full-size mediathumbnail: Generated thumbnail (with fallback to media)Query parameters:
url=<source_url>: Original source URL (encoded)Sources: g4f/gui/server/backend_api.py487-504 g4f/gui/server/backend_api.py413-485
Endpoint: POST /backend-api/v2/files/<bucket_id>
Request: multipart/form-data with files[] field
Response:
Sources: g4f/gui/server/backend_api.py413-485
Endpoint: GET /backend-api/v2/files/<bucket_id>/stream
Query parameters:
delete_files: Boolean (default: true)refine_chunks_with_spacy: Boolean (default: false)Response: text/event-stream
data: {"action": "download", "count": 1}
data: {"action": "load", "size": 4096}
data: {"action": "media", "filename": "image.jpg"}
data: {"action": "done", "size": 1048576}
Sources: g4f/gui/server/backend_api.py386-411
Endpoints:
GET /backend-api/v2/files/<bucket_id>: Stream content (non-SSE)DELETE /backend-api/v2/files/<bucket_id>: Delete entire bucketDELETE response:
Sources: g4f/gui/server/backend_api.py390-405
The bucket tool is invoked via the standard tool calling mechanism:
When bucket_tool is in the tool calls list, ToolHandler.process_bucket_tool() is invoked automatically during message processing in async_iter_run_tools() and iter_run_tools().
Sources: g4f/tools/run_tools.py113-145
Diagram: Complete Bucket Integration Flow
Sources: g4f/tools/run_tools.py236-346 g4f/tools/run_tools.py88-110 g4f/gui/server/api.py178-311
| Entity | Location | Purpose |
|---|---|---|
get_bucket_dir() | g4f/files11-31 | Resolve bucket directory path |
supports_filename() | g4f/tools/files.py89-122 | Check if file format is supported |
stream_read_files() | g4f/tools/files.py149-213 | Extract text from various file formats |
read_bucket() | g4f/tools/files.py246-260 | Read cached bucket content |
cache_stream() | g4f/tools/files.py215-226 | Write-through cache implementation |
split_file_by_size_and_newline() | g4f/tools/files.py286-319 | Split large files into chunks |
download_urls() | g4f/tools/files.py413-488 | Async URL downloading with recursion |
get_streaming() | g4f/tools/files.py559-568 | Main streaming orchestrator |
ToolHandler.process_bucket_tool() | g4f/tools/run_tools.py88-110 | Inject bucket content into messages |
BUCKET_INSTRUCTIONS | g4f/tools/run_tools.py32-34 | Citation format instructions |
upload_files() | g4f/gui/server/backend_api.py413-485 | File upload endpoint handler |
manage_files() | g4f/gui/server/backend_api.py390-411 | Streaming/deletion endpoint handler |
Sources: All files referenced in table
Required:
aiofile: Async file I/O (optional, for concurrent writes)aiohttp: HTTP client for URL downloadsDocument Processing (at least one per format):
PyPDF2 or pdfplumber or pdfminer: PDF extractionpython-docx or docx2txt: DOCX extractionodfpy: ODT supportebooklib: EPUB supportpandas + openpyxl: XLSX supportBeautifulSoup4: HTML scrapingOptional:
spacy + en_core_web_sm model: Text refinementmarkitdown: Universal document conversionPillow: Image processing and thumbnailsInstall all file processing dependencies:
Sources: g4f/tools/files.py17-74
plain.cache to avoid reprocessingSources: g4f/tools/files.py215-319 g4f/tools/files.py413-488
Refresh this wiki