This page provides an overview of the nanochat system and guides you through the initial setup and execution of a complete training run. It covers the prerequisites, system architecture, and the end-to-end training pipeline from tokenization to deployment.
For detailed installation instructions, see Installation and Setup. For a step-by-step walkthrough of executing the speedrun, see Quick Start: Running the Speedrun. For an in-depth explanation of each pipeline stage, see Training Pipeline Overview.
The default configuration is optimized for 8xH100 GPUs and completes in approximately 2.76 hours for a GPT-2-grade model. Alternative configurations:
| Configuration | Hardware | Training Time | Use Case |
|---|---|---|---|
| Production (d26) | 8xH100 | ~2.76 hours | Full GPT-2 capability |
| Development (d12) | 8xH100 | ~1.5 hours | Faster iteration |
| CPU Demo (d6) | MacBook M3 Max | ~40 minutes | Educational/testing |
uv (automatically installed by runs/speedrun.sh22)Sources: pyproject.toml1-75 runs/speedrun.sh1-98 runs/runcpu.sh1-66
The nanochat system consists of three primary layers: data preparation, training, and deployment. Each layer is implemented as standalone Python modules that can be invoked independently or orchestrated via shell scripts.
Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66
The complete training pipeline consists of five sequential stages, each implemented as a separate Python module. The runs/speedrun.sh script orchestrates these stages and manages intermediate artifacts.
Sources: runs/speedrun.sh1-98
The speedrun script uses environment variables to configure runtime behavior:
| Variable | Default | Purpose |
|---|---|---|
NANOCHAT_BASE_DIR | ~/.cache/nanochat | Directory for checkpoints and artifacts runs/speedrun.sh15 |
WANDB_RUN | dummy | Weights & Biases run name (set to enable logging) runs/speedrun.sh37-40 |
OMP_NUM_THREADS | 1 | OpenMP thread count (prevents CPU oversubscription) runs/speedrun.sh14 |
All intermediate and final artifacts are stored in $NANOCHAT_BASE_DIR:
~/.cache/nanochat/
├── data/ # FineWeb-EDU parquet shards
├── tokenizers/ # Trained tokenizer files
├── checkpoints/ # Model checkpoints (base + SFT)
├── identity_conversations.jsonl # Synthetic personality data
└── report/ # Generated markdown reports
Sources: runs/speedrun.sh14-16 runs/speedrun.sh43-46
Nanochat uses the uv package manager for fast, deterministic dependency resolution. The package manager is automatically installed by the speedrun script if not present runs/speedrun.sh22
The speedrun script performs these steps:
curl -LsSf https://astral.sh/uv/install.sh | sh runs/speedrun.sh22uv venv creates .venv/ directory runs/speedrun.sh24uv sync --extra gpu installs PyTorch with CUDA 12.8 runs/speedrun.sh26source .venv/bin/activate runs/speedrun.sh28The pyproject.toml defines two mutually exclusive extras:
| Extra | PyTorch Index | Use Case |
|---|---|---|
gpu | pytorch-cu128 | H100 GPUs with CUDA 12.8 pyproject.toml47-48 |
cpu | pytorch-cpu | Development/testing without GPU pyproject.toml46 |
The conflict constraint prevents simultaneous installation pyproject.toml69-74
Sources: pyproject.toml1-75 runs/speedrun.sh19-28 runs/runcpu.sh16-19
The speedrun script uses auto-configuration based on a single --depth parameter. For the production d26 model:
Parameter Explanation:
--depth=26: Creates a 26-layer transformer (matches GPT-2 architecture)--target-param-data-ratio=8.25: Trains on 8.25× tokens per parameter (slightly undertrained for speed)--device-batch-size=16: Batch size per GPU device--fp8: Enables FP8 mixed precision training on H100 GPUs--run=$WANDB_RUN: Weights & Biases run identifierThe depth parameter automatically configures:
Sources: runs/speedrun.sh73
To enable experiment tracking:
wandb login before starting trainingexport WANDB_RUN=my_experiment_namebash runs/speedrun.shIf WANDB_RUN is unset or set to dummy, wandb logging is disabled runs/speedrun.sh37-40
The nanochat.report module generates comprehensive markdown reports:
python -m nanochat.report reset clears previous reports and writes system info runs/speedrun.sh46python -m nanochat.report generate compiles all sections into report.md runs/speedrun.sh97Reports include:
Sources: runs/speedrun.sh43-46 runs/speedrun.sh95-97
| Stage | Duration | Output |
|---|---|---|
| Data download (8 shards) | ~2 minutes | ~/.cache/nanochat/data/ |
| Tokenizer training | ~5 minutes | ~/.cache/nanochat/tokenizers/tok32768.tiktoken |
| Data download (370 shards) | ~15 minutes (background) | ~/.cache/nanochat/data/ |
| Base pretraining | ~2.5 hours | ~/.cache/nanochat/checkpoints/baserun/ |
| Base evaluation | ~10 minutes | CORE score > 0.256525 (GPT-2 threshold) |
| Identity data download | ~10 seconds | identity_conversations.jsonl |
| SFT training | ~15 minutes | ~/.cache/nanochat/checkpoints/sft/ |
| Chat evaluation | ~5 minutes | ChatCORE scores |
| Total | ~2.76 hours | Deployable chat model |
Checkpoints are saved with three files per step:
checkpoints/baserun/
├── step_010000_model.pt # Model weights
├── step_010000_optim.pt # Optimizer state
└── step_010000_meta.pt # Training metadata
The checkpoint manager automatically finds the latest checkpoint for resumption (see Checkpoint Management).
Sources: runs/speedrun.sh56-76
After the speedrun completes, verify the model works:
The model should:
identity_conversations.jsonl<|python_start|> tokensSources: runs/speedrun.sh88-92
Now that you understand the high-level system architecture and execution flow:
For understanding specific subsystems:
Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66 pyproject.toml1-75
Refresh this wiki