Running LLMs

Relevant source files

README.md

Purpose and Scope

This document covers the fundamentals of executing Large Language Models (LLMs) for inference, which is the first step in building LLM-powered applications. It explains how to access models through APIs or run them locally, apply prompt engineering techniques to improve outputs, and generate structured responses for downstream applications.

This page focuses on model execution and interaction. For information about deploying LLMs at scale or creating production services, see Deployment. For building applications that augment LLMs with external data, see Retrieval Augmented Generation and Agents. For optimizing model file sizes and memory usage, see Quantization.

LLM Execution Methods

Running LLMs requires choosing between API-based access and local execution. This decision depends on factors including hardware availability, privacy requirements, cost constraints, and latency tolerance.

Execution Method Overview

Sources: README.md311-324

Comparison of Execution Methods

Method	Privacy	Cost	Latency	Hardware Required	Model Selection
Private APIs	Low	Per-token fees	Low	None	Provider's models only
Open-source APIs	Medium	Per-token fees or credits	Low-Medium	None	Wide selection
Local GUI	High	Hardware cost	High (first token)	GPU/CPU	Any compatible model
Local CLI	High	Hardware cost	Medium-High	GPU/CPU	Any compatible model
HF Spaces	Medium	Free tier available	Medium	None (ZeroGPU)	Limited selection

Sources: README.md311-324

Using LLM APIs

API-based execution provides immediate access to powerful models without local hardware requirements. The ecosystem divides into proprietary services and open-source model providers.

Private LLM APIs

Private APIs offer frontier models with strong performance but limited transparency:

OpenAI: Provides GPT-4, GPT-4 Turbo, and GPT-3.5 models with function calling and structured output support
Google Cloud: Offers Gemini models through Vertex AI with multimodal capabilities
Anthropic: Provides Claude models with extended context windows (100K+ tokens)

These services typically charge per token (input and output separately) and require API keys for authentication.

Open-Source Model APIs

Open-source APIs provide access to models like Llama, Mistral, and others:

OpenRouter: Aggregates multiple model providers with unified API, supporting cost-based routing
Hugging Face Inference API: Direct access to models hosted on Hugging Face Hub
Together AI: Optimized inference for open-source models with competitive pricing

Sources: README.md315-316

Running LLMs Locally

Local execution provides privacy, control, and zero marginal cost after initial hardware investment. Multiple tools support different use cases and technical expertise levels.

Local Execution Tools

Sources: README.md317 README.md39

Tool Descriptions

LM Studio

Cross-platform GUI application (Windows, macOS, Linux)
Downloads models directly from Hugging Face Hub
Supports GGUF format with automatic model discovery
Provides built-in chat interface and API server

ollama

Command-line tool with model library management
Uses ollama run <model> syntax for execution
Automatic model downloading and management
Supports model customization through Modelfiles

llama.cpp

Low-level C++ inference engine
Maximum performance on CPU and GPU
Requires manual model conversion to GGUF format
See Quantization for GGUF conversion details

text-generation-webui (oobabooga)

Web-based Gradio interface
Supports multiple quantization formats (GPTQ, GGUF, EXL2)
Extensions system for additional features
Character cards and multi-turn conversations

kobold.cpp

Fork of llama.cpp focused on story generation
Supports KoboldAI API compatibility
Optimized for creative writing workflows

Sources: README.md317 README.md415

ZeroSpace Notebook

The ZeroSpace notebook automates creation of Gradio-based chat interfaces using Hugging Face's ZeroGPU infrastructure, which provides free GPU access for demos.

Sources: README.md39

Prompt Engineering

Prompt engineering involves crafting inputs to LLMs to elicit desired outputs. Techniques range from simple templates to complex multi-step reasoning patterns.

Core Prompting Techniques

Sources: README.md318

Zero-Shot Prompting

Direct instruction without examples:

Translate the following English text to French:
"The quick brown fox jumps over the lazy dog."

Effective for simple tasks with well-understood instructions. Works better with larger models (70B+ parameters).

Few-Shot Prompting

Provides examples to demonstrate desired behavior:

Translate the following English text to French:

English: Hello, how are you?
French: Bonjour, comment allez-vous?

English: I love programming.
French: J'aime programmer.

English: The quick brown fox jumps over the lazy dog.
French:

Significantly improves performance on specialized or ambiguous tasks. Examples consume context window tokens.

Chain-of-Thought (CoT)

Instructs model to show reasoning steps:

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?

Let's think step by step:
1. Roger starts with 5 tennis balls
2. He buys 2 cans of tennis balls
3. Each can has 3 tennis balls, so 2 cans have 2 × 3 = 6 tennis balls
4. Total tennis balls = 5 + 6 = 11

Answer: 11 tennis balls

Improves accuracy on mathematical and logical reasoning tasks. Can use "Let's think step by step" to trigger CoT behavior.

ReAct Pattern

Combines reasoning with actions:

Question: What is the population of the capital of France?

Thought: I need to find the capital of France first.
Action: search["capital of France"]
Observation: The capital of France is Paris.

Thought: Now I need to find the population of Paris.
Action: search["population of Paris"]
Observation: Paris has a population of approximately 2.2 million.

Thought: I have found the answer.
Answer: Approximately 2.2 million people.

Enables LLMs to interact with external tools and APIs. See Agents for advanced implementations.

Decoding Strategies

The course includes a dedicated notebook on decoding strategies that control how LLMs generate text:

Greedy Search: Selects highest probability token at each step
Beam Search: Maintains multiple candidate sequences
Temperature Sampling: Controls randomness (0 = deterministic, >1 = creative)
Top-k Sampling: Samples from k most likely tokens
Top-p (Nucleus) Sampling: Samples from smallest set with cumulative probability p

Sources: README.md71 README.md179

Structured Output Generation

Many applications require LLMs to generate outputs conforming to specific formats (JSON, XML, templates). Structured output techniques ensure valid formatting and enable reliable parsing.

Structured Output Methods

Sources: README.md319 README.md323-324

Native API JSON Mode

Many APIs (OpenAI, Anthropic) support JSON mode natively:

Model generates valid JSON but requires schema specification in prompt.

Outlines Library

Outlines provides guided generation ensuring outputs match specified formats:

Regex-based: Generate text matching regular expressions
JSON Schema: Enforce JSON structure with type constraints
Grammar-based: Use context-free grammars for complex formats
Type constraints: Ensure specific data types (integers, booleans, enums)

Outlines works by constraining the token sampling process, making invalid tokens impossible to select.

LMQL Query Language

LMQL provides a programming language for LLM interactions with built-in constraints:

# Example structure (not actual code from repo)
"Extract information about {subject}[CONTEXT]"
"Name: {name}[NAME]" where len(TOKENS(name)) < 10
"Age: {age}[AGE]" where CAST(age, int) > 0

Combines natural language prompts with programmatic constraints and control flow.

Sources: README.md323-324

Notebook Resources

The course includes practical notebooks demonstrating LLM execution concepts:

Relevant Notebooks

Notebook	Purpose	Related Concept
ZeroSpace	Create Gradio chat interface with ZeroGPU	Local/demo execution
Decoding Strategies	Text generation from beam search to nucleus sampling	Sampling methods
Improve ChatGPT with Knowledge Graphs	Augment responses with structured knowledge	Structured outputs

ZeroSpace Notebook

Automates deployment of chat interfaces:

Uses Hugging Face ZeroGPU for free GPU access
Generates Gradio web interface automatically
Suitable for model demos and testing

Sources: README.md39

Decoding Strategies Notebook

Provides implementations and visualizations of:

Greedy and beam search algorithms
Temperature, top-k, and top-p sampling
Comparison of generation quality vs diversity

Sources: README.md71

Practical Considerations

Hardware Requirements

Model Size	RAM (CPU)	VRAM (GPU)	Quantization
7B params	16GB	8GB	GGUF Q4
13B params	32GB	16GB	GGUF Q4
34B params	64GB	24GB	GGUF Q4
70B params	128GB	48GB	GGUF Q4

Quantization (see Quantization) reduces requirements significantly. A 70B model quantized to 4-bit can run in ~40GB VRAM.

Context Window Management

LLMs have fixed context windows (token limits):

Smaller models: 2K-4K tokens
Medium models: 8K-32K tokens
Large models: 100K+ tokens

Context includes prompt, few-shot examples, chat history, and generated output. Monitor token usage to avoid truncation.

Response Time Factors

First token latency: Time to generate first token (includes prompt processing)
Tokens per second: Generation speed after first token
Model size: Larger models have higher latency
Quantization: Lower precision improves speed
Hardware: GPU acceleration significantly faster than CPU

Sources: README.md311-324

Integration Patterns

Workflow for Running LLMs

Sources: README.md311-324

This page covers basic LLM execution. For more advanced topics:

Vector Storage: Storing embeddings for retrieval
Retrieval Augmented Generation: Augmenting LLMs with external knowledge
Advanced RAG: Query construction and tool integration
Agents: Autonomous systems using LLMs with tools
Inference Optimization: Flash Attention and speculative decoding
Deployment: Production deployment strategies
Quantization: Model compression techniques

Sources: README.md305-441

Running LLMs

Relevant source files

README.md

Purpose and Scope

LLM Execution Methods

Execution Method Overview

Sources: README.md311-324

Comparison of Execution Methods

Method	Privacy	Cost	Latency	Hardware Required	Model Selection
Private APIs	Low	Per-token fees	Low	None	Provider's models only
Open-source APIs	Medium	Per-token fees or credits	Low-Medium	None	Wide selection
Local GUI	High	Hardware cost	High (first token)	GPU/CPU	Any compatible model
Local CLI	High	Hardware cost	Medium-High	GPU/CPU	Any compatible model
HF Spaces	Medium	Free tier available	Medium	None (ZeroGPU)	Limited selection

Sources: README.md311-324

Using LLM APIs

API-based execution provides immediate access to powerful models without local hardware requirements. The ecosystem divides into proprietary services and open-source model providers.

Private LLM APIs

Private APIs offer frontier models with strong performance but limited transparency:

OpenAI: Provides GPT-4, GPT-4 Turbo, and GPT-3.5 models with function calling and structured output support
Google Cloud: Offers Gemini models through Vertex AI with multimodal capabilities
Anthropic: Provides Claude models with extended context windows (100K+ tokens)

These services typically charge per token (input and output separately) and require API keys for authentication.

Open-Source Model APIs

Open-source APIs provide access to models like Llama, Mistral, and others:

OpenRouter: Aggregates multiple model providers with unified API, supporting cost-based routing
Hugging Face Inference API: Direct access to models hosted on Hugging Face Hub
Together AI: Optimized inference for open-source models with competitive pricing

Sources: README.md315-316

Running LLMs Locally

Local execution provides privacy, control, and zero marginal cost after initial hardware investment. Multiple tools support different use cases and technical expertise levels.

Local Execution Tools

Sources: README.md317 README.md39

Tool Descriptions

LM Studio

Cross-platform GUI application (Windows, macOS, Linux)
Downloads models directly from Hugging Face Hub
Supports GGUF format with automatic model discovery
Provides built-in chat interface and API server

ollama

Command-line tool with model library management
Uses ollama run <model> syntax for execution
Automatic model downloading and management
Supports model customization through Modelfiles

llama.cpp

Low-level C++ inference engine
Maximum performance on CPU and GPU
Requires manual model conversion to GGUF format
See Quantization for GGUF conversion details

text-generation-webui (oobabooga)

Web-based Gradio interface
Supports multiple quantization formats (GPTQ, GGUF, EXL2)
Extensions system for additional features
Character cards and multi-turn conversations

kobold.cpp

Fork of llama.cpp focused on story generation
Supports KoboldAI API compatibility
Optimized for creative writing workflows

Sources: README.md317 README.md415

ZeroSpace Notebook

The ZeroSpace notebook automates creation of Gradio-based chat interfaces using Hugging Face's ZeroGPU infrastructure, which provides free GPU access for demos.

Sources: README.md39

Prompt Engineering

Prompt engineering involves crafting inputs to LLMs to elicit desired outputs. Techniques range from simple templates to complex multi-step reasoning patterns.

Core Prompting Techniques

Sources: README.md318

Zero-Shot Prompting

Direct instruction without examples:

Translate the following English text to French:
"The quick brown fox jumps over the lazy dog."

Effective for simple tasks with well-understood instructions. Works better with larger models (70B+ parameters).

Few-Shot Prompting

Provides examples to demonstrate desired behavior:

Translate the following English text to French:

English: Hello, how are you?
French: Bonjour, comment allez-vous?

English: I love programming.
French: J'aime programmer.

English: The quick brown fox jumps over the lazy dog.
French:

Significantly improves performance on specialized or ambiguous tasks. Examples consume context window tokens.

Chain-of-Thought (CoT)

Instructs model to show reasoning steps:

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?

Let's think step by step:
1. Roger starts with 5 tennis balls
2. He buys 2 cans of tennis balls
3. Each can has 3 tennis balls, so 2 cans have 2 × 3 = 6 tennis balls
4. Total tennis balls = 5 + 6 = 11

Answer: 11 tennis balls

Improves accuracy on mathematical and logical reasoning tasks. Can use "Let's think step by step" to trigger CoT behavior.

ReAct Pattern

Combines reasoning with actions:

Question: What is the population of the capital of France?

Thought: I need to find the capital of France first.
Action: search["capital of France"]
Observation: The capital of France is Paris.

Thought: Now I need to find the population of Paris.
Action: search["population of Paris"]
Observation: Paris has a population of approximately 2.2 million.

Thought: I have found the answer.
Answer: Approximately 2.2 million people.

Enables LLMs to interact with external tools and APIs. See Agents for advanced implementations.

Decoding Strategies

The course includes a dedicated notebook on decoding strategies that control how LLMs generate text:

Greedy Search: Selects highest probability token at each step
Beam Search: Maintains multiple candidate sequences
Temperature Sampling: Controls randomness (0 = deterministic, >1 = creative)
Top-k Sampling: Samples from k most likely tokens
Top-p (Nucleus) Sampling: Samples from smallest set with cumulative probability p

Sources: README.md71 README.md179

Structured Output Generation

Many applications require LLMs to generate outputs conforming to specific formats (JSON, XML, templates). Structured output techniques ensure valid formatting and enable reliable parsing.

Structured Output Methods

Sources: README.md319 README.md323-324

Native API JSON Mode

Many APIs (OpenAI, Anthropic) support JSON mode natively:

Model generates valid JSON but requires schema specification in prompt.

Outlines Library

Outlines provides guided generation ensuring outputs match specified formats:

Regex-based: Generate text matching regular expressions
JSON Schema: Enforce JSON structure with type constraints
Grammar-based: Use context-free grammars for complex formats
Type constraints: Ensure specific data types (integers, booleans, enums)

Outlines works by constraining the token sampling process, making invalid tokens impossible to select.

LMQL Query Language

LMQL provides a programming language for LLM interactions with built-in constraints:

# Example structure (not actual code from repo)
"Extract information about {subject}[CONTEXT]"
"Name: {name}[NAME]" where len(TOKENS(name)) < 10
"Age: {age}[AGE]" where CAST(age, int) > 0

Combines natural language prompts with programmatic constraints and control flow.

Sources: README.md323-324

Notebook Resources

The course includes practical notebooks demonstrating LLM execution concepts:

Relevant Notebooks

Notebook	Purpose	Related Concept
ZeroSpace	Create Gradio chat interface with ZeroGPU	Local/demo execution
Decoding Strategies	Text generation from beam search to nucleus sampling	Sampling methods
Improve ChatGPT with Knowledge Graphs	Augment responses with structured knowledge	Structured outputs

ZeroSpace Notebook

Automates deployment of chat interfaces:

Uses Hugging Face ZeroGPU for free GPU access
Generates Gradio web interface automatically
Suitable for model demos and testing

Sources: README.md39

Decoding Strategies Notebook

Provides implementations and visualizations of:

Greedy and beam search algorithms
Temperature, top-k, and top-p sampling
Comparison of generation quality vs diversity

Sources: README.md71

Practical Considerations

Hardware Requirements

Model Size	RAM (CPU)	VRAM (GPU)	Quantization
7B params	16GB	8GB	GGUF Q4
13B params	32GB	16GB	GGUF Q4
34B params	64GB	24GB	GGUF Q4
70B params	128GB	48GB	GGUF Q4

Quantization (see Quantization) reduces requirements significantly. A 70B model quantized to 4-bit can run in ~40GB VRAM.

Context Window Management

LLMs have fixed context windows (token limits):

Smaller models: 2K-4K tokens
Medium models: 8K-32K tokens
Large models: 100K+ tokens

Context includes prompt, few-shot examples, chat history, and generated output. Monitor token usage to avoid truncation.

Response Time Factors

First token latency: Time to generate first token (includes prompt processing)
Tokens per second: Generation speed after first token
Model size: Larger models have higher latency
Quantization: Lower precision improves speed
Hardware: GPU acceleration significantly faster than CPU

Sources: README.md311-324

Integration Patterns

Workflow for Running LLMs

Sources: README.md311-324

This page covers basic LLM execution. For more advanced topics:

Vector Storage: Storing embeddings for retrieval
Retrieval Augmented Generation: Augmenting LLMs with external knowledge
Advanced RAG: Query construction and tool integration
Agents: Autonomous systems using LLMs with tools
Inference Optimization: Flash Attention and speculative decoding
Deployment: Production deployment strategies
Quantization: Model compression techniques

Sources: README.md305-441

Running LLMs

Purpose and Scope

LLM Execution Methods

Execution Method Overview

Comparison of Execution Methods

Using LLM APIs

Private LLM APIs

Open-Source Model APIs

Running LLMs Locally

Local Execution Tools

Tool Descriptions

ZeroSpace Notebook

Prompt Engineering

Core Prompting Techniques

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

ReAct Pattern

Decoding Strategies

Structured Output Generation

Structured Output Methods

Native API JSON Mode

Outlines Library

LMQL Query Language

Notebook Resources

Relevant Notebooks

ZeroSpace Notebook

Decoding Strategies Notebook

Practical Considerations

Hardware Requirements

Context Window Management

Response Time Factors

Integration Patterns

Workflow for Running LLMs

Related Topics

On this page

Running LLMs

Purpose and Scope

LLM Execution Methods

Execution Method Overview

Comparison of Execution Methods

Using LLM APIs

Private LLM APIs

Open-Source Model APIs

Running LLMs Locally

Local Execution Tools

Tool Descriptions

ZeroSpace Notebook

Prompt Engineering

Core Prompting Techniques

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

ReAct Pattern

Decoding Strategies

Structured Output Generation

Structured Output Methods

Native API JSON Mode

Outlines Library

LMQL Query Language

Notebook Resources

Relevant Notebooks

ZeroSpace Notebook

Decoding Strategies Notebook

Practical Considerations

Hardware Requirements

Context Window Management

Response Time Factors

Integration Patterns

Workflow for Running LLMs

Related Topics

On this page