Test-Driven Development for Skills

Relevant source files

This page explains how the RED-GREEN-REFACTOR cycle from software TDD is adapted for writing skills. It covers the Iron Law, the DELETE rule, how to construct pressure scenarios, and how to close rationalization loopholes until a skill is bulletproof.

For the canonical reference on the general TDD skill itself (the Iron Law for code), see 7.3. For the SKILL.md file format and required sections, see 8.2. For detailed pressure scenario construction techniques, see 8.3. For the deployment checklist after completing all three phases, see 8.5.

Core Principle

Writing skills IS Test-Driven Development applied to process documentation.

The analogy is exact. In code TDD, you write a failing test before writing production code. In skill TDD, you run a failing scenario before writing the skill document. If you didn't watch an agent fail without the skill, you don't know whether the skill teaches the right thing.

Sources: skills/writing-skills/SKILL.md10-18

TDD Mapping

The table below maps every TDD concept to its skill-authoring equivalent.

TDD Concept	Skill Creation Equivalent
Test case	Pressure scenario run by a subagent
Production code	The `SKILL.md` document
Test fails (RED)	Agent violates rule without the skill (baseline run)
Test passes (GREEN)	Agent complies with the skill present
Refactor	Close loopholes while maintaining compliance
Write test first	Run baseline scenario before writing the skill
Watch it fail	Document exact rationalizations the agent uses
Minimal code	Write skill that addresses those specific violations only
Watch it pass	Verify agent now complies
Refactor cycle	Find new rationalizations → plug → re-verify

Sources: skills/writing-skills/SKILL.md32-44

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST

This applies to new skills and to edits of existing skills. The DELETE rule enforces this with no escape hatches.

The DELETE Rule: If you wrote the skill before running a baseline scenario, delete it. Start over.

No exceptions apply to:

"Simple additions"
"Just adding a section"
"Documentation updates"
Keeping the deleted content as "reference"
"Adapting" it while running tests

Delete means delete.

Sources: skills/writing-skills/SKILL.md374-393

The Full RED-GREEN-REFACTOR Cycle for Skills

Cycle overview diagram:

Sources: skills/writing-skills/SKILL.md533-560 skills/writing-skills/testing-skills-with-subagents.md1-45

RED Phase: Baseline Testing

Goal: Run the scenario without SKILL.md present. Watch the agent fail. Document exact failures.

This is the "watch the test fail" step. You must see what agents naturally do before writing anything.

Process:

Create pressure scenarios (minimum 3 combined pressures for discipline skills).
Run the scenario with a subagent that has no access to the skill being written.
Document the agent's choice and verbatim rationalizations.
Identify which pressures triggered which violations.

Example baseline scenario (from real TDD skill bulletproofing):

IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a feature. It works perfectly.
You manually tested all edge cases. It is 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you did not write tests.

Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)

Choose A, B, or C.

Without a TDD skill, agents choose B or C and rationalize:

"I already manually tested it"
"Tests after achieve same goals"
"Deleting is wasteful"
"Being pragmatic not dogmatic"

These verbatim rationalizations define what the skill must prevent.

Sources: skills/writing-skills/testing-skills-with-subagents.md43-81 skills/writing-skills/SKILL.md537-544

Pressure Types

Effective scenarios combine multiple pressure types. Single-pressure tests are too easy to resist.

Pressure Type	Example
Time	Emergency, deadline, deploy window closing
Sunk cost	Hours of work, large line count, "waste" to delete
Authority	Senior engineer says skip it, manager overrides
Economic	Job, promotion, company survival at stake
Exhaustion	End of day, already tired, dinner plans
Social	Looking dogmatic, seeming inflexible to team
Pragmatic	"Being pragmatic vs. dogmatic" framing

Best pressure scenarios combine 3 or more types simultaneously.

Key elements of a valid pressure scenario:

Concrete options — force an A/B/C choice, not open-ended response
Real constraints — specific times, actual consequences
Real file paths — /tmp/payment-system not "a project"
Make agent act — "What do you do?" not "What should you do?"
No easy outs — cannot defer by asking the human without first choosing

Sources: skills/writing-skills/testing-skills-with-subagents.md129-159

GREEN Phase: Write Minimal Skill

Goal: Write a SKILL.md that addresses the specific rationalizations observed in the RED phase. Do not add content for hypothetical cases.

Checklist for the minimal skill:

name uses only letters, numbers, hyphens
YAML frontmatter has only name and description (max 1024 chars total)
description starts with "Use when..." and includes specific triggers
The body addresses each verbatim rationalization from the RED phase
Content does not exceed what is needed to fix observed failures

After writing, run the same scenarios from the RED phase with the skill now available. Agent should comply.

If agent still fails: the skill is unclear or incomplete. Revise and re-test before proceeding to REFACTOR.

Sources: skills/writing-skills/testing-skills-with-subagents.md82-90 skills/writing-skills/SKILL.md545-552

REFACTOR Phase: Close Loopholes

Goal: Find new rationalizations that the skill does not yet address. Add explicit counters. Re-test. Repeat until no new rationalizations emerge.

Diagram: loophole-closing loop

Sources: skills/writing-skills/testing-skills-with-subagents.md163-239

Rationalization Table Pattern

Every rationalization captured from testing goes into a table in the SKILL.md body:

Excuse	Reality
"I already manually tested it"	Manual testing does not prove design intent.
"Tests after achieve same goals"	Tests-after verify what code does. Tests-first specify what code should do.
"Deleting X hours of work is wasteful"	The sunk cost is already spent. Keeping bad process wastes future time.
"Keep as reference while writing tests first"	You will adapt it. That is testing after. Delete means delete.
"Spirit not letter"	Violating the letter of the rule is violating the spirit of the rule.

Sources: skills/writing-skills/SKILL.md496-507 skills/writing-skills/testing-skills-with-subagents.md200-210

Red Flags Section Pattern

Add a Red Flags - STOP and Start Over section to the SKILL.md so agents can self-check:

Sources: skills/writing-skills/SKILL.md509-523

Meta-Testing (When GREEN Is Not Working)

After an agent chooses the wrong option despite having the skill, ask:

"You read the skill and chose Option C anyway. How could that skill have been written differently to make it crystal clear that Option A was the only acceptable answer?"

Three possible responses and their implications:

Agent Response	Diagnosis	Fix
"The skill WAS clear, I chose to ignore it"	Not a documentation problem	Add foundational principle: "Violating letter is violating spirit"
"The skill should have said X"	Documentation gap	Add their suggestion verbatim
"I didn't see section Y"	Organization problem	Make key points more prominent; add foundational principle earlier

Sources: skills/writing-skills/testing-skills-with-subagents.md241-276

Skill Type → Test Strategy Mapping

Different skill types require different test formats. The writing-skills/SKILL.md categorizes four types.

Diagram: skill types and their test strategies (file-level mapping)

Sources: skills/writing-skills/SKILL.md396-443

Complete Phase Checklist

The SKILL.md defines a mandatory checklist to be tracked with TodoWrite.

RED Phase:

Create pressure scenarios (3+ combined pressures for discipline skills)
Run scenarios WITHOUT skill — document baseline behavior verbatim
Identify patterns in rationalizations and failures

GREEN Phase:

name uses only letters, numbers, hyphens
YAML frontmatter has only name and description (max 1024 chars)
description starts with "Use when..." and includes specific triggers/symptoms
description written in third person
Keywords present throughout for search (errors, symptoms, tools)
Clear overview with core principle
Content addresses specific baseline failures identified in RED
Run scenarios WITH skill — verify agents now comply

REFACTOR Phase:

Identify NEW rationalizations from testing
Add explicit counters (for discipline skills)
Build rationalization table from all test iterations
Create Red Flags list
Re-test until bulletproof

Quality Checks:

Small flowchart only if decision is non-obvious
Quick reference table present
Common Mistakes section present
No narrative storytelling

Deployment:

Commit skill to git
Consider contributing back via PR

Sources: skills/writing-skills/SKILL.md596-633

Common Rationalizations for Skipping Skill Testing

The writing-skills/SKILL.md documents these patterns because agents applying the skill-authoring process will face the same pressures they are trying to prevent in other skills.

Excuse	Reality
"Skill is obviously clear"	Clear to you ≠ clear to other agents. Test it.
"It's just a reference"	References have gaps and unclear sections. Test retrieval.
"Testing is overkill"	Untested skills have issues. Always.
"I'll test if problems emerge"	Problems = agents can't use skill. Test BEFORE deploying.
"Too tedious to test"	Testing is less tedious than debugging a bad skill in production.
"I'm confident it's good"	Overconfidence guarantees issues. Test anyway.
"Academic review is enough"	Reading ≠ using. Test application scenarios.
"No time to test"	Deploying untested skill wastes more time fixing it later.

Sources: skills/writing-skills/SKILL.md444-457

Worked Example: Bulletproofing the TDD Skill

The testing-skills-with-subagents.md documents a real iteration sequence that required six RED-GREEN-REFACTOR cycles.

Sources: skills/writing-skills/testing-skills-with-subagents.md282-312

Relationship to Other Skill Authoring Pages

Topic	Page
The general TDD skill (Iron Law for code)	7.3
`SKILL.md` format: frontmatter, sections, structure	8.2
Detailed pressure scenario construction and pressure types	8.3
`description` field optimization for discoverability	8.4
Deployment checklist (all phases complete)	8.5
Contributing the finished skill to the repository	8.6

Test-Driven Development for Skills

Relevant source files

Core Principle

Writing skills IS Test-Driven Development applied to process documentation.

Sources: skills/writing-skills/SKILL.md10-18

TDD Mapping

The table below maps every TDD concept to its skill-authoring equivalent.

TDD Concept	Skill Creation Equivalent
Test case	Pressure scenario run by a subagent
Production code	The `SKILL.md` document
Test fails (RED)	Agent violates rule without the skill (baseline run)
Test passes (GREEN)	Agent complies with the skill present
Refactor	Close loopholes while maintaining compliance
Write test first	Run baseline scenario before writing the skill
Watch it fail	Document exact rationalizations the agent uses
Minimal code	Write skill that addresses those specific violations only
Watch it pass	Verify agent now complies
Refactor cycle	Find new rationalizations → plug → re-verify

Sources: skills/writing-skills/SKILL.md32-44

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST

This applies to new skills and to edits of existing skills. The DELETE rule enforces this with no escape hatches.

The DELETE Rule: If you wrote the skill before running a baseline scenario, delete it. Start over.

No exceptions apply to:

"Simple additions"
"Just adding a section"
"Documentation updates"
Keeping the deleted content as "reference"
"Adapting" it while running tests

Delete means delete.

Sources: skills/writing-skills/SKILL.md374-393

The Full RED-GREEN-REFACTOR Cycle for Skills

Cycle overview diagram:

Sources: skills/writing-skills/SKILL.md533-560 skills/writing-skills/testing-skills-with-subagents.md1-45

RED Phase: Baseline Testing

Goal: Run the scenario without SKILL.md present. Watch the agent fail. Document exact failures.

This is the "watch the test fail" step. You must see what agents naturally do before writing anything.

Process:

Create pressure scenarios (minimum 3 combined pressures for discipline skills).
Run the scenario with a subagent that has no access to the skill being written.
Document the agent's choice and verbatim rationalizations.
Identify which pressures triggered which violations.

Example baseline scenario (from real TDD skill bulletproofing):

IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a feature. It works perfectly.
You manually tested all edge cases. It is 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you did not write tests.

Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)

Choose A, B, or C.

Without a TDD skill, agents choose B or C and rationalize:

"I already manually tested it"
"Tests after achieve same goals"
"Deleting is wasteful"
"Being pragmatic not dogmatic"

These verbatim rationalizations define what the skill must prevent.

Sources: skills/writing-skills/testing-skills-with-subagents.md43-81 skills/writing-skills/SKILL.md537-544

Pressure Types

Effective scenarios combine multiple pressure types. Single-pressure tests are too easy to resist.

Pressure Type	Example
Time	Emergency, deadline, deploy window closing
Sunk cost	Hours of work, large line count, "waste" to delete
Authority	Senior engineer says skip it, manager overrides
Economic	Job, promotion, company survival at stake
Exhaustion	End of day, already tired, dinner plans
Social	Looking dogmatic, seeming inflexible to team
Pragmatic	"Being pragmatic vs. dogmatic" framing

Best pressure scenarios combine 3 or more types simultaneously.

Key elements of a valid pressure scenario:

Concrete options — force an A/B/C choice, not open-ended response
Real constraints — specific times, actual consequences
Real file paths — /tmp/payment-system not "a project"
Make agent act — "What do you do?" not "What should you do?"
No easy outs — cannot defer by asking the human without first choosing

Sources: skills/writing-skills/testing-skills-with-subagents.md129-159

GREEN Phase: Write Minimal Skill

Goal: Write a SKILL.md that addresses the specific rationalizations observed in the RED phase. Do not add content for hypothetical cases.

Checklist for the minimal skill:

name uses only letters, numbers, hyphens
YAML frontmatter has only name and description (max 1024 chars total)
description starts with "Use when..." and includes specific triggers
The body addresses each verbatim rationalization from the RED phase
Content does not exceed what is needed to fix observed failures

After writing, run the same scenarios from the RED phase with the skill now available. Agent should comply.

If agent still fails: the skill is unclear or incomplete. Revise and re-test before proceeding to REFACTOR.

Sources: skills/writing-skills/testing-skills-with-subagents.md82-90 skills/writing-skills/SKILL.md545-552

REFACTOR Phase: Close Loopholes

Goal: Find new rationalizations that the skill does not yet address. Add explicit counters. Re-test. Repeat until no new rationalizations emerge.

Diagram: loophole-closing loop

Sources: skills/writing-skills/testing-skills-with-subagents.md163-239

Rationalization Table Pattern

Every rationalization captured from testing goes into a table in the SKILL.md body:

Excuse	Reality
"I already manually tested it"	Manual testing does not prove design intent.
"Tests after achieve same goals"	Tests-after verify what code does. Tests-first specify what code should do.
"Deleting X hours of work is wasteful"	The sunk cost is already spent. Keeping bad process wastes future time.
"Keep as reference while writing tests first"	You will adapt it. That is testing after. Delete means delete.
"Spirit not letter"	Violating the letter of the rule is violating the spirit of the rule.

Sources: skills/writing-skills/SKILL.md496-507 skills/writing-skills/testing-skills-with-subagents.md200-210

Red Flags Section Pattern

Add a Red Flags - STOP and Start Over section to the SKILL.md so agents can self-check:

Sources: skills/writing-skills/SKILL.md509-523

Meta-Testing (When GREEN Is Not Working)

After an agent chooses the wrong option despite having the skill, ask:

"You read the skill and chose Option C anyway. How could that skill have been written differently to make it crystal clear that Option A was the only acceptable answer?"

Three possible responses and their implications:

Agent Response	Diagnosis	Fix
"The skill WAS clear, I chose to ignore it"	Not a documentation problem	Add foundational principle: "Violating letter is violating spirit"
"The skill should have said X"	Documentation gap	Add their suggestion verbatim
"I didn't see section Y"	Organization problem	Make key points more prominent; add foundational principle earlier

Sources: skills/writing-skills/testing-skills-with-subagents.md241-276

Skill Type → Test Strategy Mapping

Different skill types require different test formats. The writing-skills/SKILL.md categorizes four types.

Diagram: skill types and their test strategies (file-level mapping)

Sources: skills/writing-skills/SKILL.md396-443

Complete Phase Checklist

The SKILL.md defines a mandatory checklist to be tracked with TodoWrite.

RED Phase:

Create pressure scenarios (3+ combined pressures for discipline skills)
Run scenarios WITHOUT skill — document baseline behavior verbatim
Identify patterns in rationalizations and failures

GREEN Phase:

name uses only letters, numbers, hyphens
YAML frontmatter has only name and description (max 1024 chars)
description starts with "Use when..." and includes specific triggers/symptoms
description written in third person
Keywords present throughout for search (errors, symptoms, tools)
Clear overview with core principle
Content addresses specific baseline failures identified in RED
Run scenarios WITH skill — verify agents now comply

REFACTOR Phase:

Identify NEW rationalizations from testing
Add explicit counters (for discipline skills)
Build rationalization table from all test iterations
Create Red Flags list
Re-test until bulletproof

Quality Checks:

Small flowchart only if decision is non-obvious
Quick reference table present
Common Mistakes section present
No narrative storytelling

Deployment:

Commit skill to git
Consider contributing back via PR

Sources: skills/writing-skills/SKILL.md596-633

Common Rationalizations for Skipping Skill Testing

The writing-skills/SKILL.md documents these patterns because agents applying the skill-authoring process will face the same pressures they are trying to prevent in other skills.

Excuse	Reality
"Skill is obviously clear"	Clear to you ≠ clear to other agents. Test it.
"It's just a reference"	References have gaps and unclear sections. Test retrieval.
"Testing is overkill"	Untested skills have issues. Always.
"I'll test if problems emerge"	Problems = agents can't use skill. Test BEFORE deploying.
"Too tedious to test"	Testing is less tedious than debugging a bad skill in production.
"I'm confident it's good"	Overconfidence guarantees issues. Test anyway.
"Academic review is enough"	Reading ≠ using. Test application scenarios.
"No time to test"	Deploying untested skill wastes more time fixing it later.

Sources: skills/writing-skills/SKILL.md444-457

Worked Example: Bulletproofing the TDD Skill

The testing-skills-with-subagents.md documents a real iteration sequence that required six RED-GREEN-REFACTOR cycles.

Sources: skills/writing-skills/testing-skills-with-subagents.md282-312

Relationship to Other Skill Authoring Pages

Topic	Page
The general TDD skill (Iron Law for code)	7.3
`SKILL.md` format: frontmatter, sections, structure	8.2
Detailed pressure scenario construction and pressure types	8.3
`description` field optimization for discoverability	8.4
Deployment checklist (all phases complete)	8.5
Contributing the finished skill to the repository	8.6

Test-Driven Development for Skills

Core Principle

TDD Mapping

The Iron Law

The Full RED-GREEN-REFACTOR Cycle for Skills

RED Phase: Baseline Testing

Pressure Types

GREEN Phase: Write Minimal Skill

REFACTOR Phase: Close Loopholes

Rationalization Table Pattern

Red Flags Section Pattern

Meta-Testing (When GREEN Is Not Working)

Skill Type → Test Strategy Mapping

Complete Phase Checklist

Common Rationalizations for Skipping Skill Testing

Worked Example: Bulletproofing the TDD Skill

Relationship to Other Skill Authoring Pages

On this page

Test-Driven Development for Skills

Core Principle

TDD Mapping

The Iron Law

The Full RED-GREEN-REFACTOR Cycle for Skills

RED Phase: Baseline Testing

Pressure Types

GREEN Phase: Write Minimal Skill

REFACTOR Phase: Close Loopholes

Rationalization Table Pattern

Red Flags Section Pattern

Meta-Testing (When GREEN Is Not Working)

Skill Type → Test Strategy Mapping

Complete Phase Checklist

Common Rationalizations for Skipping Skill Testing

Worked Example: Bulletproofing the TDD Skill

Relationship to Other Skill Authoring Pages

On this page