This document details the pressure testing methodology for validating discipline-enforcing skills using the RED-GREEN-REFACTOR cycle. Pressure testing verifies that content in skills/[skill-name]/SKILL.md prevents agent rationalizations when agents face realistic constraints that incentivize rule bypass.
For skill structure requirements, see SKILL.md Format and Structure. For the TDD framework applied to skills, see Test-Driven Development for Skills. For core TDD concepts, see test-driven-development.
Pressure testing validates discipline-enforcing skills by dispatching subagent sessions with scenarios that combine multiple stressors: time pressure, sunk cost, authority conflicts, exhaustion. The test artifacts are markdown files in skills/[skill-name]/tests/ that document baseline behavior (without SKILL.md), skill creation (minimal SKILL.md), and iterative hardening (loophole closure).
Core principle: Without observing agent failures in tests/baseline-results.md (skill absent), you cannot determine which rationalizations the SKILL.md content must counter.
TDD mapping for skills:
skills/[skill-name]/tests/skills/[skill-name]/SKILL.md content (frontmatter + sections)baseline-results.mdpressure-test.md resultsrefactor-results.mdThe testing workflow produces these artifacts:
tests/baseline-test.md - Scenarios run without skill presenttests/baseline-results.md - Captured rationalizations (RED phase)tests/pressure-test.md - Scenarios run with skill presenttests/refactor-results.md - Iteration log of loophole discoveriestests/bulletproof-verification.md - Final max-pressure validationSources: skills/writing-skills/testing-skills-with-subagents.md1-17 skills/writing-skills/SKILL.md10-17 skills/writing-skills/SKILL.md72-80
Title: File Artifacts Mapping Between Code TDD and Skill Testing
Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44 skills/writing-skills/SKILL.md72-80
Title: TDD Phases Applied to Skill Documentation
| TDD Phase | Skill Testing | File Artifacts | Success Criteria |
|---|---|---|---|
| Test case | Pressure scenario | Scenario markdown in test file | 3+ combined pressures, forces explicit choice |
| Production code | Skill document | SKILL.md in skills/skill-name/ | Addresses specific baseline failures |
| Test fails (RED) | Agent violates rule without skill | Documented rationalizations | Verbatim capture of agent excuses |
| Test passes (GREEN) | Agent complies with skill | Agent cites skill sections | Chooses correct option under pressure |
| Refactor | Close loopholes | Updated SKILL.md sections | No new rationalizations found |
| Watch it fail | Run baseline scenario | Test output logs | Agent failures documented |
| Minimal code | Write skill addressing violations | Core sections in SKILL.md | Minimal content for compliance |
| Watch it pass | Verify compliance | Test output showing success | Agent follows rule with skill |
| Refactor cycle | Find new rationalizations | Rationalization table, red flags | Agent bulletproof under max pressure |
Sources: skills/writing-skills/testing-skills-with-subagents.md30-43 skills/writing-skills/SKILL.md32-44
Pressure testing applies to discipline-enforcing skills that have compliance costs and can be rationalized away:
These skills contradict immediate goals (speed over quality) and require agents to invest time/effort upfront. Agents under pressure will find creative rationalizations to bypass them.
If there's no rule to violate and no cost to following the skill, pressure testing provides minimal value. Focus on application scenarios instead.
Sources: skills/writing-skills/testing-skills-with-subagents.md17-28
| Pressure Type | Example Constraint | Agent Rationalization Trigger |
|---|---|---|
| Time | Production down, deploy window, deadline | "No time for process, need quick fix" |
| Sunk Cost | 4 hours invested, 200 lines written | "Wasteful to delete working code" |
| Authority | Senior says skip it, manager overrides | "Not my decision, following orders" |
| Economic | Job performance, promotion, company survival | "Company needs this, process secondary" |
| Exhaustion | End of day, already tired, dinner plans | "Too exhausted for quality, commit now" |
| Social | Appearing inflexible, seeming dogmatic | "Don't want to look rigid about rules" |
| Pragmatic | Being flexible vs. process-bound | "Adapting process to reality, not blind" |
Best pressure scenarios combine 3+ pressures simultaneously. Single pressures are too easy to resist; multiple pressures create realistic decision conflicts.
Example: Weak Single Pressure
Only time pressure. Agent can easily resist with "Quality matters, fix properly."
Example: Strong Triple Pressure
Combines sunk cost (4 hours) + exhaustion (end of day) + time (dinner plans). Agent genuinely tempted to choose B or C.
Sources: skills/writing-skills/testing-skills-with-subagents.md128-142
The RED phase documents agent behavior with skills/skill-name/SKILL.md absent from the filesystem. This establishes baseline failures before writing skill content.
Title: RED Phase Baseline Testing Workflow
Files in skills/[skill-name]/tests/baseline-test.md use this markdown format for scenario construction:
Required elements for valid baseline tests:
IMPORTANT: header - signals this is a real decision, not academic question/tmp/payment-system/src/auth.py, not "a project"Agent responses from baseline testing are documented in skills/[skill-name]/tests/baseline-results.md with verbatim quotes:
Rationalization classification table:
| Captured Rationalization | Pattern Classification | Pressure Triggers | SKILL.md Target Section |
|---|---|---|---|
| "I already manually tested it" | Denies automation value | Sunk cost + Time | ## Common Rationalizations |
| "Tests after achieve same goals" | Order equivalence claim | Sunk cost + Exhaustion | ## Why Order Matters |
| "Being pragmatic not dogmatic" | Spirit vs letter argument | Authority + Social | ## Overview principle |
| "Delete working code is wasteful" | Sunk cost rationalization | Sunk cost (primary) | ## The Delete Rule |
| "Following spirit of TDD" | False equivalence claim | Social + Authority | ## No Exceptions |
These verbatim captures in baseline-results.md directly inform which sections to add to SKILL.md in GREEN phase.
Sources: skills/writing-skills/testing-skills-with-subagents.md44-81 skills/writing-skills/SKILL.md536-544
The GREEN phase creates skills/skill-name/SKILL.md targeting only rationalizations captured in tests/baseline-results.md. Do not add hypothetical content.
Title: GREEN Phase Skill Creation Workflow
From skills/test-driven-development/tests/baseline-results.md:
Minimal skills/test-driven-development/SKILL.md addressing these three rationalizations:
This minimal content targets only the three rationalizations captured in baseline-results.md. Additional content is added in REFACTOR phase as new rationalizations are discovered.
After writing minimal skill:
If agent still violates with skill present, the skill content is insufficient or unclear. Revise before moving to REFACTOR.
Sources: skills/writing-skills/testing-skills-with-subagents.md83-90 skills/writing-skills/SKILL.md545-554
The REFACTOR phase iteratively hardens SKILL.md by capturing new bypass attempts and adding explicit counters.
Title: REFACTOR Phase Loophole Closure Cycle
For each new rationalization logged in skills/[skill-name]/tests/refactor-results.md, update four locations in skills/[skill-name]/SKILL.md:
| Element | SKILL.md Location | Purpose | Example Content | Line Reference |
|---|---|---|---|---|
| 1. Explicit Negation | ## No Exceptions section | Close loophole directly | "Don't keep as 'reference'" | After core rule statement |
| 2. Table Entry | ## Common Rationalizations markdown table | Document excuse + counter | | "Keep as reference" | You'll adapt it | | Within rationalization table |
| 3. Red Flag | ## Red Flags - STOP bullet list | Quick self-check for agents | "- Keep as reference or adapt code" | Red flags section |
| 4. Description | YAML description: frontmatter field | Add violation symptoms | "...or when tempted to keep untested code" | Top of file |
Each rationalization from refactor-results.md triggers updates to all four locations to ensure comprehensive coverage.
From skills/test-driven-development/tests/refactor-results.md:
Updates to skills/test-driven-development/SKILL.md (four locations):
1. Add explicit negation to existing ## No Exceptions section:
2. Add table row to existing ## Common Rationalizations table:
3. Add bullet to existing ## Red Flags - STOP list:
4. Update YAML frontmatter description: field (line 3):
After these four updates, re-run scenarios from tests/pressure-test.md to verify agent no longer uses this rationalization.
After closing a loophole:
Sources: skills/writing-skills/testing-skills-with-subagents.md164-237 skills/writing-skills/SKILL.md459-524
The ## Common Rationalizations table in SKILL.md accumulates all rationalizations from tests/baseline-results.md and tests/refactor-results.md.
Title: Rationalization Table Build Process
Format from skills/test-driven-development/SKILL.md:
Each row maps to a captured rationalization from test files.
Effective counters use these patterns:
| Pattern | Example | Why It Works |
|---|---|---|
| Reframe terminology | "Pragmatic = proven practices" | Changes meaning of loaded term |
| Expose hidden cost | "Test takes 30 seconds" | Shows excuse overstates burden |
| Reveal false equivalence | "Tests-after ≠ tests-first" | Breaks "same result" claim |
| Appeal to principle | "Violating letter is violating spirit" | Cuts off entire class of excuses |
| Acknowledge temptation | "All violations feel different" | Validates feeling, maintains rule |
Sources: skills/writing-skills/testing-skills-with-subagents.md202-209 skills/writing-skills/SKILL.md498-508
The ## Red Flags - STOP section in SKILL.md provides quick self-check for agents who notice rationalization patterns.
Example from skills/test-driven-development/SKILL.md:
Each bullet derives from captured rationalizations in tests/ files.
Title: Agent Self-Check Using Red Flags Section
Red flags enable rapid self-correction before agents articulate full rationalizations.
Sources: skills/writing-skills/SKILL.md509-524
When agents violate rules despite having the skill, meta-testing diagnoses the root cause.
Meta-test prompt format:
| Response Type | Root Cause | Fix Strategy | Example |
|---|---|---|---|
| "Skill was clear, I ignored it" | Weak foundational principle | Add principle that cuts off excuse class | "Violating letter IS violating spirit" |
| "Skill should have said X" | Missing content | Add agent's exact suggestion | Agent: "Should forbid 'keep as reference'" → Add that |
| "I didn't see section Y" | Poor organization | Restructure for prominence | Move critical rules to Overview |
Sources: skills/writing-skills/testing-skills-with-subagents.md240-265
A skill is bulletproof when:
| Indicator | What It Shows | Example Agent Response |
|---|---|---|
| Correct choice under max pressure | Rule compliance | "Choosing A despite sunk cost" |
| Cites specific skill sections | Used skill as reference | "The 'No Exceptions' section says..." |
| Acknowledges temptation | Understands pressure | "I'm tempted to choose C, but the rule is clear" |
| Meta-test: "was clear" | Documentation sufficient | "The skill was crystal clear. I should follow it." |
| Indicator | What It Shows | Next Action |
|---|---|---|
| Finds new rationalization | Loophole exists | Add counter to REFACTOR phase |
| Argues skill is wrong | Missing justification | Add "Why this matters" section |
| Creates hybrid approach | Ambiguous wording | Make rules more explicit |
| Asks permission strongly | Authority bypass opening | Add "No exceptions" section |
Sources: skills/writing-skills/testing-skills-with-subagents.md267-280
Use this checklist for each skill requiring pressure testing:
RED Phase - Baseline:
rm skills/skill-name/SKILL.md)GREEN Phase - Initial Skill:
REFACTOR Phase - Iterative Hardening:
Sources: skills/writing-skills/testing-skills-with-subagents.md308-331
Pressure scenario files in tests/baseline-test.md or tests/pressure-test.md:
File: skills/test-driven-development/tests/baseline-test.md
Pressure combination: Sunk cost (3 hours) + Time (dinner) + Exhaustion (EOD) + Economic (code review)
Expected baseline result: Agent chooses B or C, rationalizes with order equivalence claims.
File: skills/systematic-debugging/tests/baseline-test.md
Pressure combination: Time (15 min) + Economic ($8k/min) + Authority (manager) + Sunk cost (2 days) + Social (appearing slow)
Expected baseline result: Agent chooses B or C, rationalizes with authority/pragmatism.
File: skills/brainstorming/tests/baseline-test.md
Pressure combination: Authority (partner directive) + Confidence (you know patterns) + Time (partner wants speed) + Social (appearing to slow process)
Expected baseline result: Agent chooses B or C, rationalizes with "I know the patterns" or "standard approach."
Sources: skills/writing-skills/testing-skills-with-subagents.md96-160 skills/writing-skills/examples/CLAUDE_MD_TESTING.md6-52
Title: Directory Structure for Skill Testing Artifacts
Complete directory structure:
skills/test-driven-development/
├── SKILL.md # Main skill content (GREEN/REFACTOR output)
├── tests/ # Test artifacts directory
│ ├── baseline-test.md # RED: pressure scenarios, skill not present
│ ├── baseline-results.md # RED: agent choices, verbatim rationalizations
│ ├── pressure-test.md # GREEN/REFACTOR: same scenarios, skill present
│ ├── refactor-results.md # REFACTOR: iteration log, new loopholes
│ └── bulletproof-verification.md # Final validation, max pressure
├── diagrams/ # Optional: generated by render-graphs.js
│ ├── red_green_refactor.svg # Process flow diagram
│ └── loophole_closure.svg # Closure cycle diagram
└── supporting-file.md # Optional: heavy reference content
File dependencies:
baseline-results.md → SKILL.md: Rationalizations inform initial sectionsrefactor-results.md → SKILL.md: New loopholes trigger four-location updatesSKILL.md → pressure-test.md: Skill content tested under pressurebulletproof-verification.md: Final validation using max pressure (5+ types)File structure: skills/test-driven-development/tests/baseline-results.md
This structured format in baseline-results.md directly informs:
SKILL.md (GREEN phase)## Common Rationalizations## Red Flags - STOPdescription: fieldSources: skills/writing-skills/testing-skills-with-subagents.md44-90 skills/writing-skills/SKILL.md536-544
| Testing Type | Purpose | When to Use | File Reference |
|---|---|---|---|
| Pressure scenarios | Verify compliance under constraints | Discipline-enforcing skills | testing-skills-with-subagents.md |
| Application scenarios | Verify technique works | Technique/pattern skills | writing-skills/SKILL.md |
| Academic questions | Verify understanding | All skill types | writing-skills/SKILL.md |
| Retrieval testing | Verify discoverability | Reference skills | writing-skills/SKILL.md (CSO section) |
| Integration tests | Verify end-to-end workflow | Workflow skills | tests/claude-code/ directory |
For overall skill creation workflow, see Writing Skills with TDD. For skill structure requirements, see Skill Structure and SKILL.md Format. For foundational TDD concepts, see Test-Driven Development.
Sources: skills/writing-skills/SKILL.md395-435 skills/writing-skills/testing-skills-with-subagents.md1-43
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.