How I Built an AI-Powered QA Agent That Generates Test Plans From Pull Requests Using Claude Code
Manual QA is one of the biggest bottlenecks in software development. Every pull request needs test cases, every feature needs regression coverage, and every bug fix needs verification steps. I built an automated system that reads your PR diff, understands what changed at the code level, and generates a comprehensive, white-box test plan — all triggered by a single slash command inside Claude Code.
In this guide, I’ll walk you through the exact architecture, every file involved, and how you can replicate this for your own project.
Table of Contents
- The Problem: Manual QA Doesn’t Scale
- Architecture Overview
- Component 1: The PR Analysis Shell Script
- Component 2: The Slash Command (Orchestrator)
- Component 3: The QA Agent (Test Plan Generator)
- Component 4: The QA Reference Playbook
- How It All Works Together
- The Output: What You Get
- Step-by-Step Implementation Guide
- Key Design Decisions and Why
- Customizing for Your Project
- Conclusion
The Problem: Manual QA Doesn’t Scale
If you’re working on a project with frequent PRs, you’ve probably experienced this:
- Developers write PRs with vague test instructions like “test that it works”
- QA engineers spend hours reading diffs to understand what actually changed
- Edge cases get missed because nobody traced every
if/elsebranch in the code - Security vulnerabilities (missing nonce checks, unsanitized input) slip through
- Regression areas are overlooked because the impact wasn’t mapped
I wanted to solve this by building a system where I type one command, and an AI agent reads the actual code diff, performs white-box analysis, and generates a structured, prioritized test plan that a QA tester can execute immediately.
Architecture Overview
The system has four components that work in a pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ /qa-pr 1234 │
│ (Slash Command) │
│ │
│ 1. Runs analyze-pr.sh ──► Generates qa-analysis.txt │
│ 2. Fetches linked GitHub issues for context │
│ 3. Invokes qa-agent agent with the analysis │
│ 4. Agent reads QA reference docs + diff │
│ 5. Generates structured test plan │
│ 6. Optionally posts test plan to PR description │
└─────────────────────────────────────────────────────────────────┘
File structure:
.claude/
├── commands/
│ └── qa-pr.md # Slash command (orchestrator)
├── agents/
│ └── qa-agent.md # QA agent definition
├── scripts/
│ └── analyze-pr.sh # PR diff analysis script
└── docs/
├── qa-playbook.md # QA reference playbook
├── rest-api.md # REST API endpoints & contracts
├── database-schema.md # Database tables & relationships
├── architecture.md # System architecture overview
└── test-environments.md # Environment configs & setup
Let’s break down each component.
Component 1: The PR Analysis Shell Script
File: .claude/scripts/analyze-pr.sh
This is the foundation — a bash script that extracts everything the AI agent needs from a PR. It uses the GitHub CLI (gh) and git diff to produce a structured analysis file.
What It Does
The script accepts a PR number (or two branch refs) and generates a comprehensive analysis covering:
- PR metadata — title, author, description, labels, review instructions
- Commit history — categorized by type (bug fix, feature, refactor, test, docs)
- Changed files summary — categorized into Frontend, Backend, API, Database, Test, Docs, Config
- Code statistics — lines added/removed per file type
- Feature detection — new classes, functions, REST endpoints, database operations, form changes
- Domain-specific change detection — SEO meta, schema, sitemaps, social meta, redirects, settings, etc.
- Testing requirements analysis — flags which test types are needed (unit, integration, E2E, API, visual, performance, accessibility)
- Framework-specific patterns — routes, middleware, hooks, cron jobs, caching
- User roles & permissions — capability checks that need role-based testing
- Data flow analysis — input sources (superglobals) and output points that need escaping
- Full file-by-file diff — the actual code changes with configurable context lines
- Complexity score — a calculated score that estimates testing effort
Key Implementation Details
Cached git operations for performance:
# Cache all git diff operations at the start to avoid repeated git calls
CHANGED_FILES=$(git diff --name-only "$BASE_REF...$HEAD_REF")
PHP_FILES_LIST=$(echo "$CHANGED_FILES" | grep '\.php$' || true)
JS_FILES_LIST=$(echo "$CHANGED_FILES" | grep -E '\.(js|vue|jsx|tsx|ts)$' || true)
if [ -n "$PHP_FILES_LIST" ]; then
PHP_DIFF=$(echo "$PHP_FILES_LIST" | xargs -I {} git diff "$BASE_REF...$HEAD_REF" -- {} 2>/dev/null || true)
fi
The script caches the full diff at startup so all subsequent pattern matching operates on in-memory strings rather than re-running git diff for each check. This is important because the script runs 30+ pattern detections.
Single-pass file categorization with awk:
Instead of running multiple grep commands over the file list, the script uses a single awk pass to categorize all files:
echo "$CHANGED_FILES" | awk '
BEGIN { frontend_count=0; backend_count=0; api_count=0; ... }
{
file = $0
if (file ~ /\.vue$/) { frontend[++frontend_count] = file }
else if (file ~ /\.php$/) {
backend[++backend_count] = file
if (file ~ /(api|rest|endpoint|ajax)/) api[++api_count] = file
}
...
}
END { ... }'
Smart commit categorization:
git log --oneline --no-merges "$BASE_REF..$HEAD_REF" | while read -r commit_line; do
if echo "$commit_msg" | grep -iE '(fix|bug|issue|resolve|close)' > /dev/null; then
echo "BUG FIX: $commit_line"
elif echo "$commit_msg" | grep -iE '(feat|feature|add|new)' > /dev/null; then
echo "NEW FEATURE: $commit_line"
fi
done
Complexity scoring:
COMPLEXITY_SCORE=0
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + NEW_FUNCTIONS_COUNT))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + NEW_CLASSES_COUNT * 2))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + REST_ENDPOINTS_COUNT * 3))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + DB_OPERATIONS_COUNT * 2))
Each type of change is weighted differently. REST endpoints get 3x because they need auth, validation, and response testing. Database operations get 2x because they need data integrity testing.
Usage
# Analyze a PR by number
./analyze-pr.sh 1234 --output-file=qa-analysis.txt
# Analyze local branches
./analyze-pr.sh main feature-branch
# Compact mode for small PRs
./analyze-pr.sh 1234 --compact
Component 2: The Slash Command (Orchestrator)
File: .claude/commands/qa-pr.md
This is the entry point — a Claude Code custom slash command that orchestrates the entire pipeline. When you type /qa-pr 1234 in Claude Code, this file defines what happens.
The Orchestration Flow
---
description: Run QA analysis on a PR and generate a comprehensive test plan
argument-hint: <pr-number> [max]
allowed-tools: Bash, Read, Write, Task
---
Step 1: Parse arguments and run the analysis script
The command accepts a PR number and an optional max flag. It runs the shell script to generate the analysis file:
cd "$(git rev-parse --show-toplevel)"
.claude/scripts/analyze-pr.sh PR_NUM --output-file=qa-analysis.txt
Step 2: Read the generated analysis
Claude Code reads the qa-analysis.txt file to understand the full context of the PR.
Step 3: Fetch linked GitHub issues for deeper context
gh pr view PR_NUM --json body -q '.body' | grep -oE '(Fixes|Closes|Resolves|Related to) #[0-9]+' | grep -oE '[0-9]+'
For each linked issue, it fetches the full issue context:
gh issue view {ISSUE_NUMBER} --json title,body,comments
This is a critical design decision: linked issues are used to enrich test cases, not dumped into the output. Bug reports from issues get synthesized into better test cases with specific reproduction steps.
Step 4: Invoke the QA agent
The command passes the analysis content, linked issue context, and QA mode (default or max) to the qa-agent subagent.
Step 5: Post to PR description (with user confirmation)
After generation, the command asks if you want to add the test plan to the PR. If yes, it:
- Wraps the plan in a collapsible
<details>block - Inserts it before the “Beta Builds” section (configurable)
- Replaces any existing test plan (idempotent)
- Updates the PR via
gh pr edit
The insertion logic handles edge cases:
# Split PR body at target section — test plan goes right before it
if grep -q '^### Beta Builds' /tmp/qa-pr-existing-body.md; then
csplit -f /tmp/qa-split- -s /tmp/qa-pr-existing-body.md '/^### Beta Builds$/'
fi
# If an existing QA Test Plan block is present, strip it
if grep -q 'QA Test Plan' "$BEFORE"; then
sed '/<details>/{N;/QA Test Plan/,/<\/details>/d}' "$BEFORE" > /tmp/qa-before-clean.md
fi
Two Modes
/qa-pr 1234— Default mode: Test Data, Test Cases, Regression Areas, Verification Commands/qa-pr 1234 max— Full mode: adds Quick Test Script (curl commands for API PRs) and Additional Test Considerations (boundary cases, integration tests, security tests)
Component 3: The QA Agent (Test Plan Generator)
File: .claude/agents/qa-agent.md
This is the brain of the system — a Claude Code custom agent that acts as an elite QA engineer. It receives the analysis output and generates structured, prioritized test cases.
Agent Definition
---
name: qa-agent
description: QA Engineer for comprehensive test case generation
tools: Read, Write, Bash
model: sonnet
color: green
---
Key choices here:
model: sonnet— Uses Sonnet for cost efficiency. The agent generates long, structured output, so using a faster/cheaper model makes sense while the orchestrator (which needs judgment) runs on the default model.tools: Read, Write, Bash— The agent can read reference docs, write the test plan, and run bash commands if needed.
White-Box Testing Approach
This is what makes the system fundamentally different from generic QA. The agent doesn’t just test the feature described in the PR — it reads the actual code diff and derives test cases from the implementation:
Branch coverage: For every if/else, switch/case, and ternary in the diff, it creates test cases that exercise each branch.
Boundary conditions: If code checks if ( $count > 10 ), it generates tests with values 9, 10, and 11.
Error paths: It finds try/catch blocks, wp_die(), return new WP_Error(), and early return guards, then creates test cases that trigger each error path.
Security analysis:
- Missing
current_user_can()checks → flagged as P0 security issue - Missing
wp_verify_nonce()→ flagged with test case - Raw SQL without
$wpdb->prepare()→ SQL injection test cases - User input without sanitization → XSS test cases
Data type assumptions: If code uses intval() but input could be an array, or uses sanitize_text_field() on HTML content, it creates type mismatch test cases.
Null/empty handling: If code accesses $data['key'] without existence checks, it generates tests with missing/empty data.
The 5-Phase Analysis Methodology
Phase 1: Context Understanding
- What changed? (feature, bug fix, refactor)
- Why? (problem statement from PR description + linked issues)
- What’s the expected outcome?
- What did the author think was important to test?
Phase 2: Test Requirement Identification For each relevant area, it identifies what needs testing across 8 categories:
- Functional (happy path, variations, error messages)
- Integration (core platform, third-party conflicts, REST API)
- UI/UX (responsive, browser compat, visual states)
- Data (CRUD, validation, edge cases)
- API (auth, validation, response format)
- Security (XSS, SQLi, CSRF, capability checks)
- Performance (load times, query efficiency, large data)
- Accessibility (keyboard nav, screen reader, contrast)
Phase 3: Test Case Generation Each test case follows a strict format:
TEST CASE ID: TC-[MODULE]-[NUMBER]
MODULE: [Feature Name]
TYPE: [Functional/Integration/UI/Security/Performance/Accessibility]
PRIORITY: [P0-Critical/P1-High/P2-Medium/P3-Low]
OBJECTIVE:
[One sentence describing what this test verifies]
PRECONDITIONS:
- [Required setup, user role, data state]
TEST STEPS:
1. [Specific action]
2. [Specific action]
EXPECTED RESULT:
- [Specific, verifiable outcome]
Smart merging rules: Related scenarios targeting the same function or UI element get merged into a single test case. Instead of 4 separate cases for “empty title”, “null title”, “special chars in title”, “XSS payload in title” — you get one “Title field input validation” case with multiple verification steps.
Phase 4: Output Generation The agent uses mode-aware templates:
- Default mode: 4 sections (Test Data, Test Cases, Regression Areas, Verification Commands)
- Max mode: 7 sections (adds Quick Test Script, Additional Test Considerations)
Phase 5: Quality Checklist Before finalizing, the agent verifies:
- All changed features covered
- Error conditions and boundary cases included
- Integration points covered
- Access control tests present
- Steps are specific and executable
- Expected results are verifiable
Reference Document System
The agent reads domain-specific reference docs based on what the PR touches:
| PR Touches | Agent Reads |
|---|---|
| Security, payloads, general QA | qa-playbook.md |
| REST API endpoints | rest-api.md |
| Database, migrations | database-schema.md |
| Meta titles, tag variables | smart-tags.md |
| Settings pages | settings-map.md |
| Cross-feature impact | feature-interactions.md |
This gives the agent domain knowledge without stuffing it all into the system prompt. The baseline playbook (qa-playbook.md) is always read; additional docs are loaded conditionally.
Component 4: The QA Reference Playbook
File: .claude/docs/qa-playbook.md (your project’s equivalent)
This is the agent’s domain knowledge — a curated reference that teaches it how to test your specific application. It includes:
Security Testing Payloads
Pre-built payloads organized by attack type:
XSS payloads — basic, event handlers, encoded, and platform-specific (shortcode context, block editor context):
<script>alert('XSS')</script>
" onmouseover="alert('XSS')
%3Cscript%3Ealert(1)%3C/script%3E
[shortcode attr="<script>alert(1)</script>"]
SQL injection payloads — basic, union-based, time-based:
' OR '1'='1
' UNION SELECT user_login,user_pass,3 FROM wp_users--
' AND SLEEP(5)--
CSRF testing commands:
# Test without nonce
curl -X POST "https://site.com/wp-admin/admin-ajax.php" \
-d "action=your_action&data=test" \
-H "Cookie: session_token=..."
Verification Commands
Ready-to-use commands for verifying different aspects of your application:
# Check meta tags
curl -s "https://site.com/page/" | grep -oP 'name="description" content="\K[^"]+'
# Extract JSON-LD schema
curl -s "https://site.com/page/" | \
grep -oP '<script type="application/ld\+json">\K[^<]+' | \
python3 -m json.tool
# Test redirects
curl -sIL "https://site.com/old-url/" 2>/dev/null | grep -E "HTTP|Location"
Test Case Templates
Pre-built test case templates for each feature area (SEO meta, schema, sitemaps, social meta, redirects, etc.) that the agent uses as starting points and customizes based on the actual diff.
Priority Classification
| Priority | When to Use | Response |
|---|---|---|
| P0 | Security issue, data loss, complete failure | Block release |
| P1 | Core feature broken, no workaround | Fix this sprint |
| P2 | Feature works with issues, workaround exists | Fix next sprint |
| P3 | Cosmetic, minor inconvenience | Backlog |
Environment Matrix
Testing environment requirements including runtime versions, database versions, and common test scenarios (fresh install vs upgrade, single instance vs multi-tenant, etc.).
How It All Works Together
Here’s the complete flow when you run /qa-pr 1234:
User types: /qa-pr 1234
│
▼
┌──────────────────────────────┐
│ qa-pr.md (Slash Command) │
│ Parses args: PR=1234 │
│ QA_MODE=default │
└──────────┬───────────────────┘
│
▼
┌──────────────────────────────┐
│ analyze-pr.sh 1234 │
│ --output-file=qa-analysis │
│ │
│ ► gh pr view (metadata) │
│ ► git diff (code changes) │
│ ► Pattern detection (30+ │
│ checks for features, │
│ security, DB ops, etc.) │
│ ► Complexity scoring │
│ │
│ Output: qa-analysis.txt │
└──────────┬───────────────────┘
│
▼
┌──────────────────────────────┐
│ Fetch linked issues │
│ gh issue view #456 │
│ (bug reports, context) │
└──────────┬───────────────────┘
│
▼
┌──────────────────────────────┐
│ qa-agent Agent │
│ (Claude Sonnet) │
│ │
│ Inputs: │
│ ► qa-analysis.txt │
│ ► Linked issue context │
│ ► QA_MODE=default │
│ │
│ Reads reference docs: │
│ ► qa-playbook.md (always) │
│ ► rest-api.md (if API PR) │
│ ► database-schema.md (if │
│ DB changes) │
│ │
│ 5-phase analysis: │
│ 1. Context understanding │
│ 2. Test requirements │
│ 3. Test case generation │
│ 4. Output formatting │
│ 5. Quality checklist │
│ │
│ Output: Structured test │
│ plan with P0-P3 cases │
└──────────┬───────────────────┘
│
▼
┌──────────────────────────────┐
│ Ask user: Post to PR? │
│ │
│ If yes: │
│ ► Wrap in <details> block │
│ ► Insert before target │
│ section in PR body │
│ ► gh pr edit --body-file │
└──────────────────────────────┘
The Output: What You Get
Default Mode Output Structure
# QA Test Plan
> **Change Type:** Bug Fix | **Risk Level:** High | **Files Changed:** 5 | **Test Cases:** 12 | **Required Plugins:** WooCommerce
---
<details>
<summary><strong>Test Data</strong></summary>
### Input Values to Test
| Value | Purpose |
|-------|---------|
| Empty string "" | Tests null/empty handling in save function |
| String > 160 chars | Tests character limit boundary |
| `<script>alert(1)</script>` | Tests XSS sanitization |
</details>
---
<details>
<summary><strong>Test Cases</strong></summary>
### P0 - Critical (Must Test)
<details>
<summary><strong>TC-META-001: SEO Title saves and displays on frontend</strong></summary>
TEST CASE ID: TC-META-001
MODULE: SEO Meta
TYPE: Functional
PRIORITY: P0-Critical
OBJECTIVE:
Verify that custom SEO title saves in the editor and renders correctly in the page source.
PRECONDITIONS:
- Admin panel access
- Plugin activated
- At least one published post
TEST STEPS:
1. Navigate to Posts > Edit any published post
2. Open the SEO Settings panel
3. Enter "Custom Test Title | My Site" in the SEO Title field
4. Click Update
5. View the post on the frontend
6. View page source (Ctrl+U)
EXPECTED RESULT:
- The <title> tag contains "Custom Test Title | My Site"
- The og:title meta tag matches
- No raw variable placeholders visible
</details>
### P1 - High Priority
[More test cases...]
### P2 - Medium Priority
[More test cases...]
</details>
---
<details>
<summary><strong>Regression Areas</strong></summary>
- [ ] Existing posts retain their custom SEO titles
- [ ] Default title format still applies to posts without custom titles
- [ ] Sitemap URLs remain accessible
- [ ] Social sharing previews still work
</details>
---
<details>
<summary><strong>Verification Commands</strong></summary>
# Check title tag on frontend
curl -s "https://your-site.test/test-post/" | grep -oP '<title>\K[^<]+'
# Verify meta description
curl -s "https://your-site.test/test-post/" | grep -oP 'name="description" content="\K[^"]+'
# Check for duplicate title tags
curl -s "https://your-site.test/test-post/" | grep -c '<title>'
</details>
Step-by-Step Implementation Guide
Prerequisites
- Claude Code installed
- GitHub CLI (
gh) installed and authenticated - A git repository with PRs on GitHub
Step 1: Create the Directory Structure
mkdir -p .claude/commands .claude/agents .claude/scripts .claude/docs
Step 2: Create the PR Analysis Script
Create .claude/scripts/analyze-pr.sh and make it executable:
chmod +x .claude/scripts/analyze-pr.sh
The script should:
- Accept a PR number or branch refs as input
- Validate prerequisites (git repo,
ghCLI, authentication) - Fetch PR metadata via
gh pr view --json - Cache the git diff upfront for performance
- Categorize changed files by type (frontend, backend, API, DB, etc.)
- Detect features and patterns using regex on the cached diff — look for new classes, functions, REST endpoints, database operations, form changes
- Detect domain-specific changes relevant to your application (the specific patterns will vary by project)
- Analyze testing requirements — flag which types of testing are needed based on file types changed
- Detect platform-specific patterns — routes, middleware, hooks, cron jobs, caching, migrations, etc.
- Analyze permissions — find capability/authorization checks in the diff
- Analyze data flow — find input sources (user input, request data) and output points
- Output the full diff with configurable context lines
- Calculate a complexity score to estimate testing effort
The key principle: extract everything the AI agent needs so it doesn’t have to run git commands itself. The script is the data preparation layer.
Step 3: Create the QA Agent Definition
Create .claude/agents/qa-agent.md (or name it for your framework):
The agent definition should include:
- Identity — Define the agent’s expertise and perspective
- Mission — What the agent produces (test cases, not code review)
- Testing approach — White-box testing methodology derived from the actual diff
- Reference document pointers — Which docs to read based on what the PR touches
- Input specification — What the agent receives (analysis file, issue context, mode)
- Analysis methodology — The multi-phase approach (context → requirements → generation → formatting → quality check)
- Test case format — Exact template with field definitions
- Merging rules — When to combine related scenarios into single test cases
- Output templates — Mode-aware templates (default vs max)
- Guidelines — Do’s and don’ts for test case quality
- Quality checklist — Final verification before output
Step 4: Create the Slash Command
Create .claude/commands/qa-pr.md:
The command should:
- Parse arguments — Extract PR number and optional mode flag
- Run the analysis script — Execute the shell script and capture output
- Fetch linked issues — Parse PR body for
Fixes #123patterns, fetch issue details - Invoke the agent — Pass all context to the QA agent
- Handle PR posting — Optionally insert the test plan into the PR description (with user confirmation)
Step 5: Create Your QA Reference Playbook
Create .claude/docs/your-qa-playbook.md:
This is the most project-specific piece. Include:
- Security testing payloads relevant to your stack
- Verification commands for your application’s features
- Test case templates for your common change types
- Priority classification matching your team’s severity levels
- Environment matrix (supported versions, configurations)
- Common gotchas specific to your framework/platform
- Integration testing matrix (third-party dependencies)
Step 6: Create Additional Reference Documents
For each major feature area of your application, create a focused reference doc that the agent conditionally reads:
rest-api.md— All API endpoints, request/response formats, auth requirementsdatabase-schema.md— Table structures, relationships, migration patternssettings-map.md— All configuration options and their effectsfeature-interactions.md— How features depend on each other (for regression analysis)
Step 7: Use It
# Open Claude Code in your project
claude
# Generate a test plan for PR #1234
/qa-pr 1234
# Full analysis mode
/qa-pr 1234 max
Key Design Decisions and Why
1. Shell Script for Data Extraction, AI for Analysis
The shell script handles deterministic operations (git diff, file categorization, pattern matching) while the AI handles judgment-based tasks (prioritization, edge case identification, test case writing). This separation means the AI agent receives clean, structured input and can focus on what it’s best at.
2. Cached Git Operations
The script caches git diff output at startup and runs all subsequent pattern matching on the cached strings. Without this, the 30+ detection checks would each spawn a new git diff process.
3. Sonnet for the Agent, Default Model for Orchestration
The QA agent generates long, structured output where Sonnet excels. The orchestrator (slash command) handles judgment calls like parsing arguments and deciding whether to post to PR — tasks where the default model’s reasoning is valuable.
4. Conditional Reference Document Loading
Instead of stuffing all domain knowledge into the agent’s system prompt, the agent reads reference docs based on what the PR actually touches. A meta-only PR doesn’t need to load the database schema reference.
5. White-Box Over Black-Box
The agent receives the full code diff, not just a feature description. This enables test cases that trace back to specific code branches, boundary conditions, and error paths — not just “test that the feature works.”
6. Linked Issue Integration Without Dumping
Issue context enriches test cases but doesn’t appear as a separate section. Bug reports become better P0 test cases with specific reproduction steps, not copy-pasted issue descriptions.
7. Test Case Merging Rules
Related scenarios (empty, null, special chars, XSS for the same field) merge into one test case. This matches how QA testers actually work — you test one area thoroughly, then move on.
8. Idempotent PR Posting
The command can be run multiple times on the same PR. It strips any existing test plan before inserting the new one, so you always get a clean update.
Customizing for Your Project
Adapting for Other Frameworks
The architecture is framework-agnostic. To adapt it:
-
Shell script: Change the pattern detection section to match your framework’s patterns (Express routes, Django views, React components, etc.)
-
Agent definition: Update the testing knowledge to match your stack’s testing patterns (middleware testing, API gateway checks, state management testing, etc.)
-
Reference playbook: Update security payloads for your stack’s common vulnerabilities and CLI tools.
-
Reference docs: Create docs for your project’s architecture — API endpoints, database schema, settings, feature interactions.
For Different Team Sizes
- Solo developer: Use default mode only. Focus the playbook on your most critical features.
- Small team: Add the PR posting feature. Use max mode for high-risk PRs.
- Large team: Create multiple reference docs per domain area. Consider adding automated Playwright test generation as a follow-up step.
For Different PR Workflows
- If you don’t use “Beta Builds” section: Change the insertion point in the slash command to wherever you want the test plan placed in the PR body.
- If you want the test plan as a PR comment instead: Replace
gh pr edit --body-filewithgh pr comment --body-file. - If you want it in a separate file: Skip the PR posting step and just save the output.
Conclusion
Building this system took iteration, but the architecture is straightforward:
- A shell script that extracts structured data from PRs
- A custom agent that understands QA testing and your domain
- A slash command that orchestrates the pipeline
- Reference documents that give the agent domain expertise
The result is a system where typing /qa-pr 1234 produces a comprehensive, prioritized, white-box test plan in under a minute — something that would take a QA engineer hours to produce manually.
The key insight is that AI is best at synthesis and judgment, not data extraction. Let shell scripts handle the deterministic parts (git diff, pattern matching), and let the AI focus on what it’s uniquely good at: reading code, understanding intent, identifying edge cases, and structuring test plans that a human can actually execute.
Every project has different features, patterns, and risk areas. The value isn’t in copying these files verbatim — it’s in understanding the architecture and adapting it to your codebase. Start with the shell script, add the agent, build your playbook, and iterate from there.