Getting Started

Relevant source files

Purpose and Scope

This page provides an overview of the nanochat system and guides you through the initial setup and execution of a complete training run. It covers the prerequisites, system architecture, and the end-to-end training pipeline from tokenization to deployment.

For detailed installation instructions, see Installation and Setup. For a step-by-step walkthrough of executing the speedrun, see Quick Start: Running the Speedrun. For an in-depth explanation of each pipeline stage, see Training Pipeline Overview.

Prerequisites

Hardware Requirements

The default configuration is optimized for 8xH100 GPUs and completes in approximately 2.76 hours for a GPT-2-grade model. Alternative configurations:

Configuration	Hardware	Training Time	Use Case
Production (d26)	8xH100	~2.76 hours	Full GPT-2 capability
Development (d12)	8xH100	~1.5 hours	Faster iteration
CPU Demo (d6)	MacBook M3 Max	~40 minutes	Educational/testing

Software Requirements

Python: >=3.10 pyproject.toml6
Package Manager: uv (automatically installed by runs/speedrun.sh22)
Operating System: Linux (recommended), macOS (CPU-only)
CUDA: 12.8 (for GPU training) pyproject.toml43-58

Optional Tools

wandb: For experiment tracking and visualization (recommended)
screen: For running multi-hour training sessions without interruption

Sources: pyproject.toml1-75 runs/speedrun.sh1-98 runs/runcpu.sh1-66

System Architecture Overview

The nanochat system consists of three primary layers: data preparation, training, and deployment. Each layer is implemented as standalone Python modules that can be invoked independently or orchestrated via shell scripts.

Core Entry Points

Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66

Training Pipeline Execution Flow

The complete training pipeline consists of five sequential stages, each implemented as a separate Python module. The runs/speedrun.sh script orchestrates these stages and manages intermediate artifacts.

Pipeline Stages and Script Invocations

Sources: runs/speedrun.sh1-98

Environment Variables and Configuration

The speedrun script uses environment variables to configure runtime behavior:

Variable	Default	Purpose
`NANOCHAT_BASE_DIR`	`~/.cache/nanochat`	Directory for checkpoints and artifacts runs/speedrun.sh15
`WANDB_RUN`	`dummy`	Weights & Biases run name (set to enable logging) runs/speedrun.sh37-40
`OMP_NUM_THREADS`	`1`	OpenMP thread count (prevents CPU oversubscription) runs/speedrun.sh14

Artifact Storage Structure

All intermediate and final artifacts are stored in $NANOCHAT_BASE_DIR:

~/.cache/nanochat/
├── data/                          # FineWeb-EDU parquet shards
├── tokenizers/                    # Trained tokenizer files
├── checkpoints/                   # Model checkpoints (base + SFT)
├── identity_conversations.jsonl  # Synthetic personality data
└── report/                        # Generated markdown reports

Sources: runs/speedrun.sh14-16 runs/speedrun.sh43-46

Dependency Management with uv

Nanochat uses the uv package manager for fast, deterministic dependency resolution. The package manager is automatically installed by the speedrun script if not present runs/speedrun.sh22

Installation Process

The speedrun script performs these steps:

Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh runs/speedrun.sh22
Create venv: uv venv creates .venv/ directory runs/speedrun.sh24
Install dependencies: uv sync --extra gpu installs PyTorch with CUDA 12.8 runs/speedrun.sh26
Activate venv: source .venv/bin/activate runs/speedrun.sh28

GPU vs CPU Extras

The pyproject.toml defines two mutually exclusive extras:

Extra	PyTorch Index	Use Case
`gpu`	`pytorch-cu128`	H100 GPUs with CUDA 12.8 pyproject.toml47-48
`cpu`	`pytorch-cpu`	Development/testing without GPU pyproject.toml46

The conflict constraint prevents simultaneous installation pyproject.toml69-74

Sources: pyproject.toml1-75 runs/speedrun.sh19-28 runs/runcpu.sh16-19

Key Training Parameters

The speedrun script uses auto-configuration based on a single --depth parameter. For the production d26 model:

Parameter Explanation:

--depth=26: Creates a 26-layer transformer (matches GPT-2 architecture)
--target-param-data-ratio=8.25: Trains on 8.25× tokens per parameter (slightly undertrained for speed)
--device-batch-size=16: Batch size per GPU device
--fp8: Enables FP8 mixed precision training on H100 GPUs
--run=$WANDB_RUN: Weights & Biases run identifier

The depth parameter automatically configures:

Model dimension and head count (via scaling laws, see Hyperparameter Scaling and Auto-Configuration)
Total batch size (via Power Lines scaling, see Scaling Laws and Auto-Configuration Research)
Learning rate and weight decay schedules

Sources: runs/speedrun.sh73

Monitoring Training Progress

Weights & Biases Integration

To enable experiment tracking:

Login: Run wandb login before starting training
Set run name: export WANDB_RUN=my_experiment_name
Launch: bash runs/speedrun.sh

If WANDB_RUN is unset or set to dummy, wandb logging is disabled runs/speedrun.sh37-40

Markdown Reports

The nanochat.report module generates comprehensive markdown reports:

Reset report: python -m nanochat.report reset clears previous reports and writes system info runs/speedrun.sh46
Generate report: python -m nanochat.report generate compiles all sections into report.md runs/speedrun.sh97

Reports include:

System configuration (hardware, software versions)
Training metrics (loss curves, throughput)
Evaluation results (CORE score, ChatCORE breakdown)
Sample generations

Sources: runs/speedrun.sh43-46 runs/speedrun.sh95-97

Expected Runtime and Checkpoints

Timeline for d26 Model (8xH100)

Stage	Duration	Output
Data download (8 shards)	~2 minutes	`~/.cache/nanochat/data/`
Tokenizer training	~5 minutes	`~/.cache/nanochat/tokenizers/tok32768.tiktoken`
Data download (370 shards)	~15 minutes (background)	`~/.cache/nanochat/data/`
Base pretraining	~2.5 hours	`~/.cache/nanochat/checkpoints/baserun/`
Base evaluation	~10 minutes	CORE score > 0.256525 (GPT-2 threshold)
Identity data download	~10 seconds	`identity_conversations.jsonl`
SFT training	~15 minutes	`~/.cache/nanochat/checkpoints/sft/`
Chat evaluation	~5 minutes	ChatCORE scores
Total	~2.76 hours	Deployable chat model

Checkpoint Format

Checkpoints are saved with three files per step:

checkpoints/baserun/
├── step_010000_model.pt      # Model weights
├── step_010000_optim.pt      # Optimizer state
└── step_010000_meta.pt       # Training metadata

The checkpoint manager automatically finds the latest checkpoint for resumption (see Checkpoint Management).

Sources: runs/speedrun.sh56-76

Quick Verification

After the speedrun completes, verify the model works:

CLI Chat

Web UI

The model should:

Respond conversationally in the style configured by identity_conversations.jsonl
Support calculator tool use via <|python_start|> tokens
Handle multi-turn conversations with proper context

Sources: runs/speedrun.sh88-92

Next Steps

Now that you understand the high-level system architecture and execution flow:

Installation and Setup: Detailed instructions for configuring your environment, including manual installation and troubleshooting
Quick Start: Running the Speedrun: Step-by-step walkthrough of executing the full pipeline with explanations of each command
Training Pipeline Overview: Deep dive into each pipeline stage, including data formats, model checkpoints, and evaluation metrics

For understanding specific subsystems:

Model architecture details: Model Architecture
Optimizer configuration: Optimization System
Data loading mechanics: Data Pipeline
Inference and deployment: Inference and Deployment

Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66 pyproject.toml1-75

Getting Started

Relevant source files

Purpose and Scope

Prerequisites

Hardware Requirements

The default configuration is optimized for 8xH100 GPUs and completes in approximately 2.76 hours for a GPT-2-grade model. Alternative configurations:

Configuration	Hardware	Training Time	Use Case
Production (d26)	8xH100	~2.76 hours	Full GPT-2 capability
Development (d12)	8xH100	~1.5 hours	Faster iteration
CPU Demo (d6)	MacBook M3 Max	~40 minutes	Educational/testing

Software Requirements

Python: >=3.10 pyproject.toml6
Package Manager: uv (automatically installed by runs/speedrun.sh22)
Operating System: Linux (recommended), macOS (CPU-only)
CUDA: 12.8 (for GPU training) pyproject.toml43-58

Optional Tools

wandb: For experiment tracking and visualization (recommended)
screen: For running multi-hour training sessions without interruption

Sources: pyproject.toml1-75 runs/speedrun.sh1-98 runs/runcpu.sh1-66

System Architecture Overview

Core Entry Points

Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66

Training Pipeline Execution Flow

Pipeline Stages and Script Invocations

Sources: runs/speedrun.sh1-98

Environment Variables and Configuration

The speedrun script uses environment variables to configure runtime behavior:

Variable	Default	Purpose
`NANOCHAT_BASE_DIR`	`~/.cache/nanochat`	Directory for checkpoints and artifacts runs/speedrun.sh15
`WANDB_RUN`	`dummy`	Weights & Biases run name (set to enable logging) runs/speedrun.sh37-40
`OMP_NUM_THREADS`	`1`	OpenMP thread count (prevents CPU oversubscription) runs/speedrun.sh14

Artifact Storage Structure

All intermediate and final artifacts are stored in $NANOCHAT_BASE_DIR:

~/.cache/nanochat/
├── data/                          # FineWeb-EDU parquet shards
├── tokenizers/                    # Trained tokenizer files
├── checkpoints/                   # Model checkpoints (base + SFT)
├── identity_conversations.jsonl  # Synthetic personality data
└── report/                        # Generated markdown reports

Sources: runs/speedrun.sh14-16 runs/speedrun.sh43-46

Dependency Management with uv

Nanochat uses the uv package manager for fast, deterministic dependency resolution. The package manager is automatically installed by the speedrun script if not present runs/speedrun.sh22

Installation Process

The speedrun script performs these steps:

Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh runs/speedrun.sh22
Create venv: uv venv creates .venv/ directory runs/speedrun.sh24
Install dependencies: uv sync --extra gpu installs PyTorch with CUDA 12.8 runs/speedrun.sh26
Activate venv: source .venv/bin/activate runs/speedrun.sh28

GPU vs CPU Extras

The pyproject.toml defines two mutually exclusive extras:

Extra	PyTorch Index	Use Case
`gpu`	`pytorch-cu128`	H100 GPUs with CUDA 12.8 pyproject.toml47-48
`cpu`	`pytorch-cpu`	Development/testing without GPU pyproject.toml46

The conflict constraint prevents simultaneous installation pyproject.toml69-74

Sources: pyproject.toml1-75 runs/speedrun.sh19-28 runs/runcpu.sh16-19

Key Training Parameters

The speedrun script uses auto-configuration based on a single --depth parameter. For the production d26 model:

Parameter Explanation:

--depth=26: Creates a 26-layer transformer (matches GPT-2 architecture)
--target-param-data-ratio=8.25: Trains on 8.25× tokens per parameter (slightly undertrained for speed)
--device-batch-size=16: Batch size per GPU device
--fp8: Enables FP8 mixed precision training on H100 GPUs
--run=$WANDB_RUN: Weights & Biases run identifier

The depth parameter automatically configures:

Model dimension and head count (via scaling laws, see Hyperparameter Scaling and Auto-Configuration)
Total batch size (via Power Lines scaling, see Scaling Laws and Auto-Configuration Research)
Learning rate and weight decay schedules

Sources: runs/speedrun.sh73

Monitoring Training Progress

Weights & Biases Integration

To enable experiment tracking:

Login: Run wandb login before starting training
Set run name: export WANDB_RUN=my_experiment_name
Launch: bash runs/speedrun.sh

If WANDB_RUN is unset or set to dummy, wandb logging is disabled runs/speedrun.sh37-40

Markdown Reports

The nanochat.report module generates comprehensive markdown reports:

Reset report: python -m nanochat.report reset clears previous reports and writes system info runs/speedrun.sh46
Generate report: python -m nanochat.report generate compiles all sections into report.md runs/speedrun.sh97

Reports include:

System configuration (hardware, software versions)
Training metrics (loss curves, throughput)
Evaluation results (CORE score, ChatCORE breakdown)
Sample generations

Sources: runs/speedrun.sh43-46 runs/speedrun.sh95-97

Expected Runtime and Checkpoints

Timeline for d26 Model (8xH100)

Stage	Duration	Output
Data download (8 shards)	~2 minutes	`~/.cache/nanochat/data/`
Tokenizer training	~5 minutes	`~/.cache/nanochat/tokenizers/tok32768.tiktoken`
Data download (370 shards)	~15 minutes (background)	`~/.cache/nanochat/data/`
Base pretraining	~2.5 hours	`~/.cache/nanochat/checkpoints/baserun/`
Base evaluation	~10 minutes	CORE score > 0.256525 (GPT-2 threshold)
Identity data download	~10 seconds	`identity_conversations.jsonl`
SFT training	~15 minutes	`~/.cache/nanochat/checkpoints/sft/`
Chat evaluation	~5 minutes	ChatCORE scores
Total	~2.76 hours	Deployable chat model

Checkpoint Format

Checkpoints are saved with three files per step:

checkpoints/baserun/
├── step_010000_model.pt      # Model weights
├── step_010000_optim.pt      # Optimizer state
└── step_010000_meta.pt       # Training metadata

The checkpoint manager automatically finds the latest checkpoint for resumption (see Checkpoint Management).

Sources: runs/speedrun.sh56-76

Quick Verification

After the speedrun completes, verify the model works:

CLI Chat

Web UI

The model should:

Respond conversationally in the style configured by identity_conversations.jsonl
Support calculator tool use via <|python_start|> tokens
Handle multi-turn conversations with proper context

Sources: runs/speedrun.sh88-92

Next Steps

Now that you understand the high-level system architecture and execution flow:

Installation and Setup: Detailed instructions for configuring your environment, including manual installation and troubleshooting
Quick Start: Running the Speedrun: Step-by-step walkthrough of executing the full pipeline with explanations of each command
Training Pipeline Overview: Deep dive into each pipeline stage, including data formats, model checkpoints, and evaluation metrics

For understanding specific subsystems:

Model architecture details: Model Architecture
Optimizer configuration: Optimization System
Data loading mechanics: Data Pipeline
Inference and deployment: Inference and Deployment

Sources: runs/speedrun.sh1-98 runs/runcpu.sh1-66 pyproject.toml1-75

Getting Started

Purpose and Scope

Prerequisites

Hardware Requirements

Software Requirements

Optional Tools

System Architecture Overview

Core Entry Points

Training Pipeline Execution Flow

Pipeline Stages and Script Invocations

Environment Variables and Configuration

Artifact Storage Structure

Dependency Management with uv

Installation Process

GPU vs CPU Extras

Key Training Parameters

Monitoring Training Progress

Weights & Biases Integration

Markdown Reports

Expected Runtime and Checkpoints

Timeline for d26 Model (8xH100)

Checkpoint Format

Quick Verification

CLI Chat

Web UI

Next Steps

On this page

Getting Started

Purpose and Scope

Prerequisites

Hardware Requirements

Software Requirements

Optional Tools

System Architecture Overview

Core Entry Points

Training Pipeline Execution Flow

Pipeline Stages and Script Invocations

Environment Variables and Configuration

Artifact Storage Structure

Dependency Management with uv

Installation Process

GPU vs CPU Extras

Key Training Parameters

Monitoring Training Progress

Weights & Biases Integration

Markdown Reports

Expected Runtime and Checkpoints

Timeline for d26 Model (8xH100)

Checkpoint Format

Quick Verification

CLI Chat

Web UI

Next Steps

On this page