This document covers the fundamentals of executing Large Language Models (LLMs) for inference, which is the first step in building LLM-powered applications. It explains how to access models through APIs or run them locally, apply prompt engineering techniques to improve outputs, and generate structured responses for downstream applications.
This page focuses on model execution and interaction. For information about deploying LLMs at scale or creating production services, see Deployment. For building applications that augment LLMs with external data, see Retrieval Augmented Generation and Agents. For optimizing model file sizes and memory usage, see Quantization.
Running LLMs requires choosing between API-based access and local execution. This decision depends on factors including hardware availability, privacy requirements, cost constraints, and latency tolerance.
Sources: README.md311-324
| Method | Privacy | Cost | Latency | Hardware Required | Model Selection |
|---|---|---|---|---|---|
| Private APIs | Low | Per-token fees | Low | None | Provider's models only |
| Open-source APIs | Medium | Per-token fees or credits | Low-Medium | None | Wide selection |
| Local GUI | High | Hardware cost | High (first token) | GPU/CPU | Any compatible model |
| Local CLI | High | Hardware cost | Medium-High | GPU/CPU | Any compatible model |
| HF Spaces | Medium | Free tier available | Medium | None (ZeroGPU) | Limited selection |
Sources: README.md311-324
API-based execution provides immediate access to powerful models without local hardware requirements. The ecosystem divides into proprietary services and open-source model providers.
Private APIs offer frontier models with strong performance but limited transparency:
These services typically charge per token (input and output separately) and require API keys for authentication.
Open-source APIs provide access to models like Llama, Mistral, and others:
Sources: README.md315-316
Local execution provides privacy, control, and zero marginal cost after initial hardware investment. Multiple tools support different use cases and technical expertise levels.
Sources: README.md317 README.md39
LM Studio
ollama
ollama run <model> syntax for executionllama.cpp
text-generation-webui (oobabooga)
kobold.cpp
Sources: README.md317 README.md415
The ZeroSpace notebook automates creation of Gradio-based chat interfaces using Hugging Face's ZeroGPU infrastructure, which provides free GPU access for demos.
Sources: README.md39
Prompt engineering involves crafting inputs to LLMs to elicit desired outputs. Techniques range from simple templates to complex multi-step reasoning patterns.
Sources: README.md318
Direct instruction without examples:
Translate the following English text to French:
"The quick brown fox jumps over the lazy dog."
Effective for simple tasks with well-understood instructions. Works better with larger models (70B+ parameters).
Provides examples to demonstrate desired behavior:
Translate the following English text to French:
English: Hello, how are you?
French: Bonjour, comment allez-vous?
English: I love programming.
French: J'aime programmer.
English: The quick brown fox jumps over the lazy dog.
French:
Significantly improves performance on specialized or ambiguous tasks. Examples consume context window tokens.
Instructs model to show reasoning steps:
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
Let's think step by step:
1. Roger starts with 5 tennis balls
2. He buys 2 cans of tennis balls
3. Each can has 3 tennis balls, so 2 cans have 2 × 3 = 6 tennis balls
4. Total tennis balls = 5 + 6 = 11
Answer: 11 tennis balls
Improves accuracy on mathematical and logical reasoning tasks. Can use "Let's think step by step" to trigger CoT behavior.
Combines reasoning with actions:
Question: What is the population of the capital of France?
Thought: I need to find the capital of France first.
Action: search["capital of France"]
Observation: The capital of France is Paris.
Thought: Now I need to find the population of Paris.
Action: search["population of Paris"]
Observation: Paris has a population of approximately 2.2 million.
Thought: I have found the answer.
Answer: Approximately 2.2 million people.
Enables LLMs to interact with external tools and APIs. See Agents for advanced implementations.
The course includes a dedicated notebook on decoding strategies that control how LLMs generate text:
Sources: README.md71 README.md179
Many applications require LLMs to generate outputs conforming to specific formats (JSON, XML, templates). Structured output techniques ensure valid formatting and enable reliable parsing.
Sources: README.md319 README.md323-324
Many APIs (OpenAI, Anthropic) support JSON mode natively:
Model generates valid JSON but requires schema specification in prompt.
Outlines provides guided generation ensuring outputs match specified formats:
Outlines works by constraining the token sampling process, making invalid tokens impossible to select.
LMQL provides a programming language for LLM interactions with built-in constraints:
# Example structure (not actual code from repo)
"Extract information about {subject}[CONTEXT]"
"Name: {name}[NAME]" where len(TOKENS(name)) < 10
"Age: {age}[AGE]" where CAST(age, int) > 0
Combines natural language prompts with programmatic constraints and control flow.
Sources: README.md323-324
The course includes practical notebooks demonstrating LLM execution concepts:
| Notebook | Purpose | Related Concept |
|---|---|---|
| ZeroSpace | Create Gradio chat interface with ZeroGPU | Local/demo execution |
| Decoding Strategies | Text generation from beam search to nucleus sampling | Sampling methods |
| Improve ChatGPT with Knowledge Graphs | Augment responses with structured knowledge | Structured outputs |
Automates deployment of chat interfaces:
Sources: README.md39
Provides implementations and visualizations of:
Sources: README.md71
| Model Size | RAM (CPU) | VRAM (GPU) | Quantization |
|---|---|---|---|
| 7B params | 16GB | 8GB | GGUF Q4 |
| 13B params | 32GB | 16GB | GGUF Q4 |
| 34B params | 64GB | 24GB | GGUF Q4 |
| 70B params | 128GB | 48GB | GGUF Q4 |
Quantization (see Quantization) reduces requirements significantly. A 70B model quantized to 4-bit can run in ~40GB VRAM.
LLMs have fixed context windows (token limits):
Context includes prompt, few-shot examples, chat history, and generated output. Monitor token usage to avoid truncation.
Sources: README.md311-324
Sources: README.md311-324
This page covers basic LLM execution. For more advanced topics:
Sources: README.md305-441
Refresh this wiki