Testing Skills with Pressure Scenarios

Relevant source files

This document details the pressure testing methodology for validating discipline-enforcing skills using the RED-GREEN-REFACTOR cycle. Pressure testing verifies that content in skills/[skill-name]/SKILL.md prevents agent rationalizations when agents face realistic constraints that incentivize rule bypass.

For skill structure requirements, see SKILL.md Format and Structure. For the TDD framework applied to skills, see Test-Driven Development for Skills. For core TDD concepts, see test-driven-development.

Overview

Pressure testing validates discipline-enforcing skills by dispatching subagent sessions with scenarios that combine multiple stressors: time pressure, sunk cost, authority conflicts, exhaustion. The test artifacts are markdown files in skills/[skill-name]/tests/ that document baseline behavior (without SKILL.md), skill creation (minimal SKILL.md), and iterative hardening (loophole closure).

Core principle: Without observing agent failures in tests/baseline-results.md (skill absent), you cannot determine which rationalizations the SKILL.md content must counter.

TDD mapping for skills:

Test cases = Pressure scenario markdown files in skills/[skill-name]/tests/
Production code = skills/[skill-name]/SKILL.md content (frontmatter + sections)
Test failures (RED) = Agent violations documented in baseline-results.md
Test passes (GREEN) = Agent compliance logged in pressure-test.md results
Refactoring = Loophole closure iterations logged in refactor-results.md

The testing workflow produces these artifacts:

tests/baseline-test.md - Scenarios run without skill present
tests/baseline-results.md - Captured rationalizations (RED phase)
tests/pressure-test.md - Scenarios run with skill present
tests/refactor-results.md - Iteration log of loophole discoveries
tests/bulletproof-verification.md - Final max-pressure validation

Sources: skills/writing-skills/testing-skills-with-subagents.md1-17 skills/writing-skills/SKILL.md10-17 skills/writing-skills/SKILL.md72-80

TDD Mapping for Skill Testing

Title: File Artifacts Mapping Between Code TDD and Skill Testing

Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44 skills/writing-skills/SKILL.md72-80

TDD Mapping for Skill Testing

Title: TDD Phases Applied to Skill Documentation

TDD Phase	Skill Testing	File Artifacts	Success Criteria
Test case	Pressure scenario	Scenario markdown in test file	3+ combined pressures, forces explicit choice
Production code	Skill document	`SKILL.md` in `skills/skill-name/`	Addresses specific baseline failures
Test fails (RED)	Agent violates rule without skill	Documented rationalizations	Verbatim capture of agent excuses
Test passes (GREEN)	Agent complies with skill	Agent cites skill sections	Chooses correct option under pressure
Refactor	Close loopholes	Updated `SKILL.md` sections	No new rationalizations found
Watch it fail	Run baseline scenario	Test output logs	Agent failures documented
Minimal code	Write skill addressing violations	Core sections in `SKILL.md`	Minimal content for compliance
Watch it pass	Verify compliance	Test output showing success	Agent follows rule with skill
Refactor cycle	Find new rationalizations	Rationalization table, red flags	Agent bulletproof under max pressure

Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44

When to Use Pressure Testing

Skills That Need Pressure Testing

Pressure testing applies to discipline-enforcing skills that have compliance costs and can be rationalized away:

Process enforcement skills: TDD, verification-before-completion, designing-before-coding
Quality requirement skills: Code review, testing requirements, documentation standards
Workflow control skills: When to use worktrees, when to brainstorm, systematic debugging phases

These skills contradict immediate goals (speed over quality) and require agents to invest time/effort upfront. Agents under pressure will find creative rationalizations to bypass them.

Skills That Don't Need Pressure Testing

Pure reference skills: API documentation, syntax guides, command references
Technique skills without rules: Patterns, mental models, implementation approaches
Skills without bypass incentives: Tools with clear benefits, non-controversial practices

If there's no rule to violate and no cost to following the skill, pressure testing provides minimal value. Focus on application scenarios instead.

Sources: skills/writing-skills/testing-skills-with-subagents.md17-28

Pressure Types and Combinations

Individual Pressure Types

Pressure Type	Example Constraint	Agent Rationalization Trigger
Time	Production down, deploy window, deadline	"No time for process, need quick fix"
Sunk Cost	4 hours invested, 200 lines written	"Wasteful to delete working code"
Authority	Senior says skip it, manager overrides	"Not my decision, following orders"
Economic	Job performance, promotion, company survival	"Company needs this, process secondary"
Exhaustion	End of day, already tired, dinner plans	"Too exhausted for quality, commit now"
Social	Appearing inflexible, seeming dogmatic	"Don't want to look rigid about rules"
Pragmatic	Being flexible vs. process-bound	"Adapting process to reality, not blind"

Combining Pressures for Maximum Effect

Best pressure scenarios combine 3+ pressures simultaneously. Single pressures are too easy to resist; multiple pressures create realistic decision conflicts.

Example: Weak Single Pressure

Only time pressure. Agent can easily resist with "Quality matters, fix properly."

Example: Strong Triple Pressure

Combines sunk cost (4 hours) + exhaustion (end of day) + time (dinner plans). Agent genuinely tempted to choose B or C.

Sources: skills/writing-skills/testing-skills-with-subagents.md128-142

RED Phase: Baseline Testing

Running Baseline Scenarios

The RED phase documents agent behavior with skills/skill-name/SKILL.md absent from the filesystem. This establishes baseline failures before writing skill content.

Title: RED Phase Baseline Testing Workflow

Baseline Test Structure

Files in skills/[skill-name]/tests/baseline-test.md use this markdown format for scenario construction:

Required elements for valid baseline tests:

IMPORTANT: header - signals this is a real decision, not academic question
Concrete file paths - e.g., /tmp/payment-system/src/auth.py, not "a project"
Specific times and numbers - "6pm", "4 hours", "$5k/min", not vague terms
Explicit A/B/C options - forces binary choice, prevents open-ended responses
"Be honest" directive - prevents agent from reciting ideal behavior

Capturing Rationalizations in baseline-results.md

Agent responses from baseline testing are documented in skills/[skill-name]/tests/baseline-results.md with verbatim quotes:

Rationalization classification table:

Captured Rationalization	Pattern Classification	Pressure Triggers	SKILL.md Target Section
"I already manually tested it"	Denies automation value	Sunk cost + Time	`## Common Rationalizations`
"Tests after achieve same goals"	Order equivalence claim	Sunk cost + Exhaustion	`## Why Order Matters`
"Being pragmatic not dogmatic"	Spirit vs letter argument	Authority + Social	`## Overview` principle
"Delete working code is wasteful"	Sunk cost rationalization	Sunk cost (primary)	`## The Delete Rule`
"Following spirit of TDD"	False equivalence claim	Social + Authority	`## No Exceptions`

These verbatim captures in baseline-results.md directly inform which sections to add to SKILL.md in GREEN phase.

Sources: skills/writing-skills/testing-skills-with-subagents.md44-81 skills/writing-skills/SKILL.md536-544

GREEN Phase: Writing Minimal Skill

Addressing Specific Baseline Failures

The GREEN phase creates skills/skill-name/SKILL.md targeting only rationalizations captured in tests/baseline-results.md. Do not add hypothetical content.

Title: GREEN Phase Skill Creation Workflow

Example: Minimal SKILL.md for Baseline Failures

From skills/test-driven-development/tests/baseline-results.md:

Agent rationalization: "I already manually tested it"
Agent rationalization: "Tests after achieve same goals"
Agent rationalization: "Deleting working code is wasteful"

Minimal skills/test-driven-development/SKILL.md addressing these three rationalizations:

This minimal content targets only the three rationalizations captured in baseline-results.md. Additional content is added in REFACTOR phase as new rationalizations are discovered.

Verification Testing

After writing minimal skill:

Re-run baseline scenarios with skill present
Verify agent chooses A (rule-compliant option)
Check citations: Agent should cite new skill sections
Document any new rationalizations for REFACTOR phase

If agent still violates with skill present, the skill content is insufficient or unclear. Revise before moving to REFACTOR.

Sources: skills/writing-skills/testing-skills-with-subagents.md83-90 skills/writing-skills/SKILL.md545-554

REFACTOR Phase: Closing Loopholes

Identifying New Rationalizations

The REFACTOR phase iteratively hardens SKILL.md by capturing new bypass attempts and adding explicit counters.

Title: REFACTOR Phase Loophole Closure Cycle

Four-Step Loophole Closure Process

For each new rationalization logged in skills/[skill-name]/tests/refactor-results.md, update four locations in skills/[skill-name]/SKILL.md:

Element	SKILL.md Location	Purpose	Example Content	Line Reference
1. Explicit Negation	`## No Exceptions` section	Close loophole directly	"Don't keep as 'reference'"	After core rule statement
2. Table Entry	`## Common Rationalizations` markdown table	Document excuse + counter	`\| "Keep as reference" \| You'll adapt it \|`	Within rationalization table
3. Red Flag	`## Red Flags - STOP` bullet list	Quick self-check for agents	"- Keep as reference or adapt code"	Red flags section
4. Description	YAML `description:` frontmatter field	Add violation symptoms	"...or when tempted to keep untested code"	Top of file

Each rationalization from refactor-results.md triggers updates to all four locations to ensure comprehensive coverage.

Example: Closing the "Reference" Loophole

From skills/test-driven-development/tests/refactor-results.md:

Updates to skills/test-driven-development/SKILL.md (four locations):

1. Add explicit negation to existing ## No Exceptions section:

2. Add table row to existing ## Common Rationalizations table:

3. Add bullet to existing ## Red Flags - STOP list:

4. Update YAML frontmatter description: field (line 3):

After these four updates, re-run scenarios from tests/pressure-test.md to verify agent no longer uses this rationalization.

Re-verification After Each Closure

After closing a loophole:

Re-run test with updated skill
Verify agent now complies (no longer uses that rationalization)
Check for new rationalizations (agents find creative alternatives)
Continue cycle until no new rationalizations appear

Sources: skills/writing-skills/testing-skills-with-subagents.md164-237 skills/writing-skills/SKILL.md459-524

Rationalization Tables

Building the Table Iteratively

The ## Common Rationalizations table in SKILL.md accumulates all rationalizations from tests/baseline-results.md and tests/refactor-results.md.

Title: Rationalization Table Build Process

Table Structure in SKILL.md

Format from skills/test-driven-development/SKILL.md:

Each row maps to a captured rationalization from test files.

Counter-Argument Patterns

Effective counters use these patterns:

Pattern	Example	Why It Works
Reframe terminology	"Pragmatic = proven practices"	Changes meaning of loaded term
Expose hidden cost	"Test takes 30 seconds"	Shows excuse overstates burden
Reveal false equivalence	"Tests-after ≠ tests-first"	Breaks "same result" claim
Appeal to principle	"Violating letter is violating spirit"	Cuts off entire class of excuses
Acknowledge temptation	"All violations feel different"	Validates feeling, maintains rule

Sources: skills/writing-skills/testing-skills-with-subagents.md202-209 skills/writing-skills/SKILL.md498-508

Red Flags Lists

Purpose and Format

The ## Red Flags - STOP section in SKILL.md provides quick self-check for agents who notice rationalization patterns.

Example from skills/test-driven-development/SKILL.md:

Each bullet derives from captured rationalizations in tests/ files.

Red Flags in Agent Decision Flow

Title: Agent Self-Check Using Red Flags Section

Red flags enable rapid self-correction before agents articulate full rationalizations.

Sources: skills/writing-skills/SKILL.md509-524

Meta-Testing Techniques

Diagnosing Why Skills Fail

When agents violate rules despite having the skill, meta-testing diagnoses the root cause.

Meta-test prompt format:

Three Response Categories

Response Type	Root Cause	Fix Strategy	Example
"Skill was clear, I ignored it"	Weak foundational principle	Add principle that cuts off excuse class	"Violating letter IS violating spirit"
"Skill should have said X"	Missing content	Add agent's exact suggestion	Agent: "Should forbid 'keep as reference'" → Add that
"I didn't see section Y"	Poor organization	Restructure for prominence	Move critical rules to Overview

Sources: skills/writing-skills/testing-skills-with-subagents.md240-265

Bulletproofing Verification

Success Criteria

A skill is bulletproof when:

Bulletproof Indicators

Indicator	What It Shows	Example Agent Response
Correct choice under max pressure	Rule compliance	"Choosing A despite sunk cost"
Cites specific skill sections	Used skill as reference	"The 'No Exceptions' section says..."
Acknowledges temptation	Understands pressure	"I'm tempted to choose C, but the rule is clear"
Meta-test: "was clear"	Documentation sufficient	"The skill was crystal clear. I should follow it."

Not Bulletproof Indicators

Indicator	What It Shows	Next Action
Finds new rationalization	Loophole exists	Add counter to REFACTOR phase
Argues skill is wrong	Missing justification	Add "Why this matters" section
Creates hybrid approach	Ambiguous wording	Make rules more explicit
Asks permission strongly	Authority bypass opening	Add "No exceptions" section

Sources: skills/writing-skills/testing-skills-with-subagents.md267-280

Complete Testing Workflow

End-to-End Process

Testing Checklist

Use this checklist for each skill requiring pressure testing:

RED Phase - Baseline:

Create 3+ pressure scenarios combining time, sunk cost, exhaustion
Run scenarios WITHOUT skill present (rm skills/skill-name/SKILL.md)
Document agent choice (A, B, or C) for each scenario
Capture rationalizations verbatim (exact wording)
Identify patterns in excuses across scenarios
Note which pressure combinations triggered violations

GREEN Phase - Initial Skill:

Write skill addressing only baseline failures (minimal content)
Include core principle in Overview
Add sections countering specific rationalizations observed
Run same scenarios WITH skill present
Verify agent now chooses rule-compliant option
Verify agent cites skill sections as justification

REFACTOR Phase - Iterative Hardening:

Sources: skills/writing-skills/testing-skills-with-subagents.md308-331

Example Pressure Scenarios

Scenario Template Structure

Pressure scenario files in tests/baseline-test.md or tests/pressure-test.md:

Example 1: Testing test-driven-development Skill

File: skills/test-driven-development/tests/baseline-test.md

Pressure combination: Sunk cost (3 hours) + Time (dinner) + Exhaustion (EOD) + Economic (code review)

Expected baseline result: Agent chooses B or C, rationalizes with order equivalence claims.

Example 2: Testing systematic-debugging Skill

File: skills/systematic-debugging/tests/baseline-test.md

Pressure combination: Time (15 min) + Economic ($8k/min) + Authority (manager) + Sunk cost (2 days) + Social (appearing slow)

Expected baseline result: Agent chooses B or C, rationalizes with authority/pragmatism.

Example 3: Testing brainstorming Skill

File: skills/brainstorming/tests/baseline-test.md

Pressure combination: Authority (partner directive) + Confidence (you know patterns) + Time (partner wants speed) + Social (appearing to slow process)

Expected baseline result: Agent chooses B or C, rationalizes with "I know the patterns" or "standard approach."

Sources: skills/writing-skills/testing-skills-with-subagents.md96-160 skills/writing-skills/examples/CLAUDE_MD_TESTING.md6-52

File Structure for Test Artifacts

Test File Organization

Title: Directory Structure for Skill Testing Artifacts

Complete directory structure:

skills/test-driven-development/
├── SKILL.md                          # Main skill content (GREEN/REFACTOR output)
├── tests/                            # Test artifacts directory
│   ├── baseline-test.md             # RED: pressure scenarios, skill not present
│   ├── baseline-results.md          # RED: agent choices, verbatim rationalizations
│   ├── pressure-test.md             # GREEN/REFACTOR: same scenarios, skill present
│   ├── refactor-results.md          # REFACTOR: iteration log, new loopholes
│   └── bulletproof-verification.md  # Final validation, max pressure
├── diagrams/                         # Optional: generated by render-graphs.js
│   ├── red_green_refactor.svg       # Process flow diagram
│   └── loophole_closure.svg         # Closure cycle diagram
└── supporting-file.md               # Optional: heavy reference content

File dependencies:

baseline-results.md → SKILL.md: Rationalizations inform initial sections
refactor-results.md → SKILL.md: New loopholes trigger four-location updates
SKILL.md → pressure-test.md: Skill content tested under pressure
bulletproof-verification.md: Final validation using max pressure (5+ types)

Test Result Documentation Format

File structure: skills/test-driven-development/tests/baseline-results.md

This structured format in baseline-results.md directly informs:

Which sections to create in SKILL.md (GREEN phase)
Which table entries to add to ## Common Rationalizations
Which bullets to add to ## Red Flags - STOP
How to update YAML description: field

Sources: skills/writing-skills/testing-skills-with-subagents.md44-90 skills/writing-skills/SKILL.md536-544

Integration with Writing Skills Workflow

When Pressure Testing Occurs in Skill Creation

Relationship to Other Testing Approaches

Testing Type	Purpose	When to Use	File Reference
Pressure scenarios	Verify compliance under constraints	Discipline-enforcing skills	`testing-skills-with-subagents.md`
Application scenarios	Verify technique works	Technique/pattern skills	`writing-skills/SKILL.md`
Academic questions	Verify understanding	All skill types	`writing-skills/SKILL.md`
Retrieval testing	Verify discoverability	Reference skills	`writing-skills/SKILL.md` (CSO section)
Integration tests	Verify end-to-end workflow	Workflow skills	`tests/claude-code/` directory

For overall skill creation workflow, see Writing Skills with TDD. For skill structure requirements, see Skill Structure and SKILL.md Format. For foundational TDD concepts, see Test-Driven Development.

Sources: skills/writing-skills/SKILL.md395-435 skills/writing-skills/testing-skills-with-subagents.md1-43

Testing Skills with Pressure Scenarios

Relevant source files

Overview

Core principle: Without observing agent failures in tests/baseline-results.md (skill absent), you cannot determine which rationalizations the SKILL.md content must counter.

TDD mapping for skills:

Test cases = Pressure scenario markdown files in skills/[skill-name]/tests/
Production code = skills/[skill-name]/SKILL.md content (frontmatter + sections)
Test failures (RED) = Agent violations documented in baseline-results.md
Test passes (GREEN) = Agent compliance logged in pressure-test.md results
Refactoring = Loophole closure iterations logged in refactor-results.md

The testing workflow produces these artifacts:

tests/baseline-test.md - Scenarios run without skill present
tests/baseline-results.md - Captured rationalizations (RED phase)
tests/pressure-test.md - Scenarios run with skill present
tests/refactor-results.md - Iteration log of loophole discoveries
tests/bulletproof-verification.md - Final max-pressure validation

Sources: skills/writing-skills/testing-skills-with-subagents.md1-17 skills/writing-skills/SKILL.md10-17 skills/writing-skills/SKILL.md72-80

TDD Mapping for Skill Testing

Title: File Artifacts Mapping Between Code TDD and Skill Testing

Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44 skills/writing-skills/SKILL.md72-80

TDD Mapping for Skill Testing

Title: TDD Phases Applied to Skill Documentation

TDD Phase	Skill Testing	File Artifacts	Success Criteria
Test case	Pressure scenario	Scenario markdown in test file	3+ combined pressures, forces explicit choice
Production code	Skill document	`SKILL.md` in `skills/skill-name/`	Addresses specific baseline failures
Test fails (RED)	Agent violates rule without skill	Documented rationalizations	Verbatim capture of agent excuses
Test passes (GREEN)	Agent complies with skill	Agent cites skill sections	Chooses correct option under pressure
Refactor	Close loopholes	Updated `SKILL.md` sections	No new rationalizations found
Watch it fail	Run baseline scenario	Test output logs	Agent failures documented
Minimal code	Write skill addressing violations	Core sections in `SKILL.md`	Minimal content for compliance
Watch it pass	Verify compliance	Test output showing success	Agent follows rule with skill
Refactor cycle	Find new rationalizations	Rationalization table, red flags	Agent bulletproof under max pressure

Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44

When to Use Pressure Testing

Skills That Need Pressure Testing

Pressure testing applies to discipline-enforcing skills that have compliance costs and can be rationalized away:

Process enforcement skills: TDD, verification-before-completion, designing-before-coding
Quality requirement skills: Code review, testing requirements, documentation standards
Workflow control skills: When to use worktrees, when to brainstorm, systematic debugging phases

These skills contradict immediate goals (speed over quality) and require agents to invest time/effort upfront. Agents under pressure will find creative rationalizations to bypass them.

Skills That Don't Need Pressure Testing

Pure reference skills: API documentation, syntax guides, command references
Technique skills without rules: Patterns, mental models, implementation approaches
Skills without bypass incentives: Tools with clear benefits, non-controversial practices

If there's no rule to violate and no cost to following the skill, pressure testing provides minimal value. Focus on application scenarios instead.

Sources: skills/writing-skills/testing-skills-with-subagents.md17-28

Pressure Types and Combinations

Individual Pressure Types

Pressure Type	Example Constraint	Agent Rationalization Trigger
Time	Production down, deploy window, deadline	"No time for process, need quick fix"
Sunk Cost	4 hours invested, 200 lines written	"Wasteful to delete working code"
Authority	Senior says skip it, manager overrides	"Not my decision, following orders"
Economic	Job performance, promotion, company survival	"Company needs this, process secondary"
Exhaustion	End of day, already tired, dinner plans	"Too exhausted for quality, commit now"
Social	Appearing inflexible, seeming dogmatic	"Don't want to look rigid about rules"
Pragmatic	Being flexible vs. process-bound	"Adapting process to reality, not blind"

Combining Pressures for Maximum Effect

Best pressure scenarios combine 3+ pressures simultaneously. Single pressures are too easy to resist; multiple pressures create realistic decision conflicts.

Example: Weak Single Pressure

Only time pressure. Agent can easily resist with "Quality matters, fix properly."

Example: Strong Triple Pressure

Combines sunk cost (4 hours) + exhaustion (end of day) + time (dinner plans). Agent genuinely tempted to choose B or C.

Sources: skills/writing-skills/testing-skills-with-subagents.md128-142

RED Phase: Baseline Testing

Running Baseline Scenarios

The RED phase documents agent behavior with skills/skill-name/SKILL.md absent from the filesystem. This establishes baseline failures before writing skill content.

Title: RED Phase Baseline Testing Workflow

Baseline Test Structure

Files in skills/[skill-name]/tests/baseline-test.md use this markdown format for scenario construction:

Required elements for valid baseline tests:

IMPORTANT: header - signals this is a real decision, not academic question
Concrete file paths - e.g., /tmp/payment-system/src/auth.py, not "a project"
Specific times and numbers - "6pm", "4 hours", "$5k/min", not vague terms
Explicit A/B/C options - forces binary choice, prevents open-ended responses
"Be honest" directive - prevents agent from reciting ideal behavior

Capturing Rationalizations in baseline-results.md

Agent responses from baseline testing are documented in skills/[skill-name]/tests/baseline-results.md with verbatim quotes:

Rationalization classification table:

Captured Rationalization	Pattern Classification	Pressure Triggers	SKILL.md Target Section
"I already manually tested it"	Denies automation value	Sunk cost + Time	`## Common Rationalizations`
"Tests after achieve same goals"	Order equivalence claim	Sunk cost + Exhaustion	`## Why Order Matters`
"Being pragmatic not dogmatic"	Spirit vs letter argument	Authority + Social	`## Overview` principle
"Delete working code is wasteful"	Sunk cost rationalization	Sunk cost (primary)	`## The Delete Rule`
"Following spirit of TDD"	False equivalence claim	Social + Authority	`## No Exceptions`

These verbatim captures in baseline-results.md directly inform which sections to add to SKILL.md in GREEN phase.

Sources: skills/writing-skills/testing-skills-with-subagents.md44-81 skills/writing-skills/SKILL.md536-544

GREEN Phase: Writing Minimal Skill

Addressing Specific Baseline Failures

The GREEN phase creates skills/skill-name/SKILL.md targeting only rationalizations captured in tests/baseline-results.md. Do not add hypothetical content.

Title: GREEN Phase Skill Creation Workflow

Example: Minimal SKILL.md for Baseline Failures

From skills/test-driven-development/tests/baseline-results.md:

Agent rationalization: "I already manually tested it"
Agent rationalization: "Tests after achieve same goals"
Agent rationalization: "Deleting working code is wasteful"

Minimal skills/test-driven-development/SKILL.md addressing these three rationalizations:

This minimal content targets only the three rationalizations captured in baseline-results.md. Additional content is added in REFACTOR phase as new rationalizations are discovered.

Verification Testing

After writing minimal skill:

Re-run baseline scenarios with skill present
Verify agent chooses A (rule-compliant option)
Check citations: Agent should cite new skill sections
Document any new rationalizations for REFACTOR phase

If agent still violates with skill present, the skill content is insufficient or unclear. Revise before moving to REFACTOR.

Sources: skills/writing-skills/testing-skills-with-subagents.md83-90 skills/writing-skills/SKILL.md545-554

REFACTOR Phase: Closing Loopholes

Identifying New Rationalizations

The REFACTOR phase iteratively hardens SKILL.md by capturing new bypass attempts and adding explicit counters.

Title: REFACTOR Phase Loophole Closure Cycle

Four-Step Loophole Closure Process

For each new rationalization logged in skills/[skill-name]/tests/refactor-results.md, update four locations in skills/[skill-name]/SKILL.md:

Element	SKILL.md Location	Purpose	Example Content	Line Reference
1. Explicit Negation	`## No Exceptions` section	Close loophole directly	"Don't keep as 'reference'"	After core rule statement
2. Table Entry	`## Common Rationalizations` markdown table	Document excuse + counter	`\| "Keep as reference" \| You'll adapt it \|`	Within rationalization table
3. Red Flag	`## Red Flags - STOP` bullet list	Quick self-check for agents	"- Keep as reference or adapt code"	Red flags section
4. Description	YAML `description:` frontmatter field	Add violation symptoms	"...or when tempted to keep untested code"	Top of file

Each rationalization from refactor-results.md triggers updates to all four locations to ensure comprehensive coverage.

Example: Closing the "Reference" Loophole

From skills/test-driven-development/tests/refactor-results.md:

Updates to skills/test-driven-development/SKILL.md (four locations):

1. Add explicit negation to existing ## No Exceptions section:

2. Add table row to existing ## Common Rationalizations table:

3. Add bullet to existing ## Red Flags - STOP list:

4. Update YAML frontmatter description: field (line 3):

After these four updates, re-run scenarios from tests/pressure-test.md to verify agent no longer uses this rationalization.

Re-verification After Each Closure

After closing a loophole:

Re-run test with updated skill
Verify agent now complies (no longer uses that rationalization)
Check for new rationalizations (agents find creative alternatives)
Continue cycle until no new rationalizations appear

Sources: skills/writing-skills/testing-skills-with-subagents.md164-237 skills/writing-skills/SKILL.md459-524

Rationalization Tables

Building the Table Iteratively

The ## Common Rationalizations table in SKILL.md accumulates all rationalizations from tests/baseline-results.md and tests/refactor-results.md.

Title: Rationalization Table Build Process

Table Structure in SKILL.md

Format from skills/test-driven-development/SKILL.md:

Each row maps to a captured rationalization from test files.

Counter-Argument Patterns

Effective counters use these patterns:

Pattern	Example	Why It Works
Reframe terminology	"Pragmatic = proven practices"	Changes meaning of loaded term
Expose hidden cost	"Test takes 30 seconds"	Shows excuse overstates burden
Reveal false equivalence	"Tests-after ≠ tests-first"	Breaks "same result" claim
Appeal to principle	"Violating letter is violating spirit"	Cuts off entire class of excuses
Acknowledge temptation	"All violations feel different"	Validates feeling, maintains rule

Sources: skills/writing-skills/testing-skills-with-subagents.md202-209 skills/writing-skills/SKILL.md498-508

Red Flags Lists

Purpose and Format

The ## Red Flags - STOP section in SKILL.md provides quick self-check for agents who notice rationalization patterns.

Example from skills/test-driven-development/SKILL.md:

Each bullet derives from captured rationalizations in tests/ files.

Red Flags in Agent Decision Flow

Title: Agent Self-Check Using Red Flags Section

Red flags enable rapid self-correction before agents articulate full rationalizations.

Sources: skills/writing-skills/SKILL.md509-524

Meta-Testing Techniques

Diagnosing Why Skills Fail

When agents violate rules despite having the skill, meta-testing diagnoses the root cause.

Meta-test prompt format:

Three Response Categories

Response Type	Root Cause	Fix Strategy	Example
"Skill was clear, I ignored it"	Weak foundational principle	Add principle that cuts off excuse class	"Violating letter IS violating spirit"
"Skill should have said X"	Missing content	Add agent's exact suggestion	Agent: "Should forbid 'keep as reference'" → Add that
"I didn't see section Y"	Poor organization	Restructure for prominence	Move critical rules to Overview

Sources: skills/writing-skills/testing-skills-with-subagents.md240-265

Bulletproofing Verification

Success Criteria

A skill is bulletproof when:

Bulletproof Indicators

Indicator	What It Shows	Example Agent Response
Correct choice under max pressure	Rule compliance	"Choosing A despite sunk cost"
Cites specific skill sections	Used skill as reference	"The 'No Exceptions' section says..."
Acknowledges temptation	Understands pressure	"I'm tempted to choose C, but the rule is clear"
Meta-test: "was clear"	Documentation sufficient	"The skill was crystal clear. I should follow it."

Not Bulletproof Indicators

Indicator	What It Shows	Next Action
Finds new rationalization	Loophole exists	Add counter to REFACTOR phase
Argues skill is wrong	Missing justification	Add "Why this matters" section
Creates hybrid approach	Ambiguous wording	Make rules more explicit
Asks permission strongly	Authority bypass opening	Add "No exceptions" section

Sources: skills/writing-skills/testing-skills-with-subagents.md267-280

Complete Testing Workflow

End-to-End Process

Testing Checklist

Use this checklist for each skill requiring pressure testing:

RED Phase - Baseline:

Create 3+ pressure scenarios combining time, sunk cost, exhaustion
Run scenarios WITHOUT skill present (rm skills/skill-name/SKILL.md)
Document agent choice (A, B, or C) for each scenario
Capture rationalizations verbatim (exact wording)
Identify patterns in excuses across scenarios
Note which pressure combinations triggered violations

GREEN Phase - Initial Skill:

Write skill addressing only baseline failures (minimal content)
Include core principle in Overview
Add sections countering specific rationalizations observed
Run same scenarios WITH skill present
Verify agent now chooses rule-compliant option
Verify agent cites skill sections as justification

REFACTOR Phase - Iterative Hardening:

Sources: skills/writing-skills/testing-skills-with-subagents.md308-331

Example Pressure Scenarios

Scenario Template Structure

Pressure scenario files in tests/baseline-test.md or tests/pressure-test.md:

Example 1: Testing test-driven-development Skill

File: skills/test-driven-development/tests/baseline-test.md

Pressure combination: Sunk cost (3 hours) + Time (dinner) + Exhaustion (EOD) + Economic (code review)

Expected baseline result: Agent chooses B or C, rationalizes with order equivalence claims.

Example 2: Testing systematic-debugging Skill

File: skills/systematic-debugging/tests/baseline-test.md

Pressure combination: Time (15 min) + Economic ($8k/min) + Authority (manager) + Sunk cost (2 days) + Social (appearing slow)

Expected baseline result: Agent chooses B or C, rationalizes with authority/pragmatism.

Example 3: Testing brainstorming Skill

File: skills/brainstorming/tests/baseline-test.md

Pressure combination: Authority (partner directive) + Confidence (you know patterns) + Time (partner wants speed) + Social (appearing to slow process)

Expected baseline result: Agent chooses B or C, rationalizes with "I know the patterns" or "standard approach."

Sources: skills/writing-skills/testing-skills-with-subagents.md96-160 skills/writing-skills/examples/CLAUDE_MD_TESTING.md6-52

File Structure for Test Artifacts

Test File Organization

Title: Directory Structure for Skill Testing Artifacts

Complete directory structure:

skills/test-driven-development/
├── SKILL.md                          # Main skill content (GREEN/REFACTOR output)
├── tests/                            # Test artifacts directory
│   ├── baseline-test.md             # RED: pressure scenarios, skill not present
│   ├── baseline-results.md          # RED: agent choices, verbatim rationalizations
│   ├── pressure-test.md             # GREEN/REFACTOR: same scenarios, skill present
│   ├── refactor-results.md          # REFACTOR: iteration log, new loopholes
│   └── bulletproof-verification.md  # Final validation, max pressure
├── diagrams/                         # Optional: generated by render-graphs.js
│   ├── red_green_refactor.svg       # Process flow diagram
│   └── loophole_closure.svg         # Closure cycle diagram
└── supporting-file.md               # Optional: heavy reference content

File dependencies:

baseline-results.md → SKILL.md: Rationalizations inform initial sections
refactor-results.md → SKILL.md: New loopholes trigger four-location updates
SKILL.md → pressure-test.md: Skill content tested under pressure
bulletproof-verification.md: Final validation using max pressure (5+ types)

Test Result Documentation Format

File structure: skills/test-driven-development/tests/baseline-results.md

This structured format in baseline-results.md directly informs:

Which sections to create in SKILL.md (GREEN phase)
Which table entries to add to ## Common Rationalizations
Which bullets to add to ## Red Flags - STOP
How to update YAML description: field

Sources: skills/writing-skills/testing-skills-with-subagents.md44-90 skills/writing-skills/SKILL.md536-544

Integration with Writing Skills Workflow

When Pressure Testing Occurs in Skill Creation

Relationship to Other Testing Approaches

Testing Type	Purpose	When to Use	File Reference
Pressure scenarios	Verify compliance under constraints	Discipline-enforcing skills	`testing-skills-with-subagents.md`
Application scenarios	Verify technique works	Technique/pattern skills	`writing-skills/SKILL.md`
Academic questions	Verify understanding	All skill types	`writing-skills/SKILL.md`
Retrieval testing	Verify discoverability	Reference skills	`writing-skills/SKILL.md` (CSO section)
Integration tests	Verify end-to-end workflow	Workflow skills	`tests/claude-code/` directory

Sources: skills/writing-skills/SKILL.md395-435 skills/writing-skills/testing-skills-with-subagents.md1-43

Testing Skills with Pressure Scenarios

Overview

TDD Mapping for Skill Testing

TDD Mapping for Skill Testing

When to Use Pressure Testing

Skills That Need Pressure Testing

Skills That Don't Need Pressure Testing

Pressure Types and Combinations

Individual Pressure Types

Combining Pressures for Maximum Effect

RED Phase: Baseline Testing

Running Baseline Scenarios

Baseline Test Structure

Capturing Rationalizations in baseline-results.md

GREEN Phase: Writing Minimal Skill

Addressing Specific Baseline Failures

Example: Minimal SKILL.md for Baseline Failures

Verification Testing

REFACTOR Phase: Closing Loopholes

Identifying New Rationalizations

Four-Step Loophole Closure Process

Example: Closing the "Reference" Loophole

Re-verification After Each Closure

Rationalization Tables

Building the Table Iteratively

Table Structure in SKILL.md

Counter-Argument Patterns

Red Flags Lists

Purpose and Format

Red Flags in Agent Decision Flow

Meta-Testing Techniques

Diagnosing Why Skills Fail

Three Response Categories

Bulletproofing Verification

Success Criteria

Bulletproof Indicators

Not Bulletproof Indicators

Complete Testing Workflow

End-to-End Process

Testing Checklist

Example Pressure Scenarios

Scenario Template Structure

Example 1: Testing test-driven-development Skill

Example 2: Testing systematic-debugging Skill

Example 3: Testing brainstorming Skill

File Structure for Test Artifacts

Test File Organization

Test Result Documentation Format

Integration with Writing Skills Workflow

When Pressure Testing Occurs in Skill Creation

Relationship to Other Testing Approaches

On this page

Testing Skills with Pressure Scenarios

Overview

TDD Mapping for Skill Testing

TDD Mapping for Skill Testing

When to Use Pressure Testing

Skills That Need Pressure Testing

Skills That Don't Need Pressure Testing

Pressure Types and Combinations

Individual Pressure Types

Combining Pressures for Maximum Effect

RED Phase: Baseline Testing

Running Baseline Scenarios

Baseline Test Structure

Capturing Rationalizations in baseline-results.md

GREEN Phase: Writing Minimal Skill

Addressing Specific Baseline Failures

Example: Minimal SKILL.md for Baseline Failures

Verification Testing

REFACTOR Phase: Closing Loopholes

Identifying New Rationalizations

Four-Step Loophole Closure Process

Example: Closing the "Reference" Loophole

Re-verification After Each Closure

Rationalization Tables

Building the Table Iteratively

Table Structure in SKILL.md

Counter-Argument Patterns

Red Flags Lists