islammdshariful
QA Engineer. I write about Playwright, automation, and AI-driven QA workflows.

How I Built an AI-Powered QA Agent That Generates Test Plans From Pull Requests Using Claude Code

Manual QA is one of the biggest bottlenecks in software development. Every pull request needs test cases, every feature needs regression coverage, and every bug fix needs verification steps. I built an automated system that reads your PR diff, understands what changed at the code level, and generates a comprehensive, white-box test plan — all triggered by a single slash command inside Claude Code.

In this guide, I’ll walk you through the exact architecture, every file involved, and how you can replicate this for your own project.

The Problem: Manual QA Doesn’t Scale
Architecture Overview
Component 1: The PR Analysis Shell Script
Component 2: The Slash Command (Orchestrator)
Component 3: The QA Agent (Test Plan Generator)
Component 4: The QA Reference Playbook
How It All Works Together
The Output: What You Get
Step-by-Step Implementation Guide
Key Design Decisions and Why
Customizing for Your Project
Conclusion

The Problem: Manual QA Doesn’t Scale

If you’re working on a project with frequent PRs, you’ve probably experienced this:

Developers write PRs with vague test instructions like “test that it works”
QA engineers spend hours reading diffs to understand what actually changed
Edge cases get missed because nobody traced every if/else branch in the code
Security vulnerabilities (missing nonce checks, unsanitized input) slip through
Regression areas are overlooked because the impact wasn’t mapped

I wanted to solve this by building a system where I type one command, and an AI agent reads the actual code diff, performs white-box analysis, and generates a structured, prioritized test plan that a QA tester can execute immediately.

Architecture Overview

The system has four components that work in a pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                     /qa-pr 1234                                 │
│                  (Slash Command)                                │
│                                                                 │
│  1. Runs analyze-pr.sh ──► Generates qa-analysis.txt            │
│  2. Fetches linked GitHub issues for context                    │
│  3. Invokes qa-agent agent with the analysis                │
│  4. Agent reads QA reference docs + diff                        │
│  5. Generates structured test plan                              │
│  6. Optionally posts test plan to PR description                │
└─────────────────────────────────────────────────────────────────┘

File structure:

.claude/
├── commands/
│   └── qa-pr.md              # Slash command (orchestrator)
├── agents/
│   └── qa-agent.md       # QA agent definition
├── scripts/
│   └── analyze-pr.sh         # PR diff analysis script
└── docs/
    ├── qa-playbook.md        # QA reference playbook
    ├── rest-api.md           # REST API endpoints & contracts
    ├── database-schema.md    # Database tables & relationships
    ├── architecture.md       # System architecture overview
    └── test-environments.md  # Environment configs & setup

Let’s break down each component.

Component 1: The PR Analysis Shell Script

File: .claude/scripts/analyze-pr.sh

This is the foundation — a bash script that extracts everything the AI agent needs from a PR. It uses the GitHub CLI (gh) and git diff to produce a structured analysis file.

What It Does

The script accepts a PR number (or two branch refs) and generates a comprehensive analysis covering:

PR metadata — title, author, description, labels, review instructions
Commit history — categorized by type (bug fix, feature, refactor, test, docs)
Changed files summary — categorized into Frontend, Backend, API, Database, Test, Docs, Config
Code statistics — lines added/removed per file type
Feature detection — new classes, functions, REST endpoints, database operations, form changes
Domain-specific change detection — SEO meta, schema, sitemaps, social meta, redirects, settings, etc.
Testing requirements analysis — flags which test types are needed (unit, integration, E2E, API, visual, performance, accessibility)
Framework-specific patterns — routes, middleware, hooks, cron jobs, caching
User roles & permissions — capability checks that need role-based testing
Data flow analysis — input sources (superglobals) and output points that need escaping
Full file-by-file diff — the actual code changes with configurable context lines
Complexity score — a calculated score that estimates testing effort

Key Implementation Details

Cached git operations for performance:

# Cache all git diff operations at the start to avoid repeated git calls
CHANGED_FILES=$(git diff --name-only "$BASE_REF...$HEAD_REF")
PHP_FILES_LIST=$(echo "$CHANGED_FILES" | grep '\.php$' || true)
JS_FILES_LIST=$(echo "$CHANGED_FILES" | grep -E '\.(js|vue|jsx|tsx|ts)$' || true)

if [ -n "$PHP_FILES_LIST" ]; then
    PHP_DIFF=$(echo "$PHP_FILES_LIST" | xargs -I {} git diff "$BASE_REF...$HEAD_REF" -- {} 2>/dev/null || true)
fi

The script caches the full diff at startup so all subsequent pattern matching operates on in-memory strings rather than re-running git diff for each check. This is important because the script runs 30+ pattern detections.

Single-pass file categorization with awk:

Instead of running multiple grep commands over the file list, the script uses a single awk pass to categorize all files:

echo "$CHANGED_FILES" | awk '
BEGIN { frontend_count=0; backend_count=0; api_count=0; ... }
{
    file = $0
    if (file ~ /\.vue$/) { frontend[++frontend_count] = file }
    else if (file ~ /\.php$/) {
        backend[++backend_count] = file
        if (file ~ /(api|rest|endpoint|ajax)/) api[++api_count] = file
    }
    ...
}
END { ... }'

Smart commit categorization:

git log --oneline --no-merges "$BASE_REF..$HEAD_REF" | while read -r commit_line; do
    if echo "$commit_msg" | grep -iE '(fix|bug|issue|resolve|close)' > /dev/null; then
        echo "BUG FIX: $commit_line"
    elif echo "$commit_msg" | grep -iE '(feat|feature|add|new)' > /dev/null; then
        echo "NEW FEATURE: $commit_line"
    fi
done

Complexity scoring:

COMPLEXITY_SCORE=0
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + NEW_FUNCTIONS_COUNT))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + NEW_CLASSES_COUNT * 2))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + REST_ENDPOINTS_COUNT * 3))
COMPLEXITY_SCORE=$((COMPLEXITY_SCORE + DB_OPERATIONS_COUNT * 2))

Each type of change is weighted differently. REST endpoints get 3x because they need auth, validation, and response testing. Database operations get 2x because they need data integrity testing.

Usage

# Analyze a PR by number
./analyze-pr.sh 1234 --output-file=qa-analysis.txt

# Analyze local branches
./analyze-pr.sh main feature-branch

# Compact mode for small PRs
./analyze-pr.sh 1234 --compact

Component 2: The Slash Command (Orchestrator)

File: .claude/commands/qa-pr.md

This is the entry point — a Claude Code custom slash command that orchestrates the entire pipeline. When you type /qa-pr 1234 in Claude Code, this file defines what happens.

The Orchestration Flow

---
description: Run QA analysis on a PR and generate a comprehensive test plan
argument-hint: <pr-number> [max]
allowed-tools: Bash, Read, Write, Task
---

Step 1: Parse arguments and run the analysis script

The command accepts a PR number and an optional max flag. It runs the shell script to generate the analysis file:

cd "$(git rev-parse --show-toplevel)"
.claude/scripts/analyze-pr.sh PR_NUM --output-file=qa-analysis.txt

Step 2: Read the generated analysis

Claude Code reads the qa-analysis.txt file to understand the full context of the PR.

Step 3: Fetch linked GitHub issues for deeper context

gh pr view PR_NUM --json body -q '.body' | grep -oE '(Fixes|Closes|Resolves|Related to) #[0-9]+' | grep -oE '[0-9]+'

For each linked issue, it fetches the full issue context:

gh issue view {ISSUE_NUMBER} --json title,body,comments

This is a critical design decision: linked issues are used to enrich test cases, not dumped into the output. Bug reports from issues get synthesized into better test cases with specific reproduction steps.

Step 4: Invoke the QA agent

The command passes the analysis content, linked issue context, and QA mode (default or max) to the qa-agent subagent.

Step 5: Post to PR description (with user confirmation)

After generation, the command asks if you want to add the test plan to the PR. If yes, it:

Wraps the plan in a collapsible <details> block
Inserts it before the “Beta Builds” section (configurable)
Replaces any existing test plan (idempotent)
Updates the PR via gh pr edit

The insertion logic handles edge cases:

# Split PR body at target section — test plan goes right before it
if grep -q '^### Beta Builds' /tmp/qa-pr-existing-body.md; then
    csplit -f /tmp/qa-split- -s /tmp/qa-pr-existing-body.md '/^### Beta Builds$/'
fi

# If an existing QA Test Plan block is present, strip it
if grep -q 'QA Test Plan' "$BEFORE"; then
    sed '/<details>/{N;/QA Test Plan/,/<\/details>/d}' "$BEFORE" > /tmp/qa-before-clean.md
fi

Two Modes

/qa-pr 1234 — Default mode: Test Data, Test Cases, Regression Areas, Verification Commands
/qa-pr 1234 max — Full mode: adds Quick Test Script (curl commands for API PRs) and Additional Test Considerations (boundary cases, integration tests, security tests)

Component 3: The QA Agent (Test Plan Generator)

File: .claude/agents/qa-agent.md

This is the brain of the system — a Claude Code custom agent that acts as an elite QA engineer. It receives the analysis output and generates structured, prioritized test cases.

Agent Definition

---
name: qa-agent
description: QA Engineer for comprehensive test case generation
tools: Read, Write, Bash
model: sonnet
color: green
---

Key choices here:

model: sonnet — Uses Sonnet for cost efficiency. The agent generates long, structured output, so using a faster/cheaper model makes sense while the orchestrator (which needs judgment) runs on the default model.
tools: Read, Write, Bash — The agent can read reference docs, write the test plan, and run bash commands if needed.

White-Box Testing Approach

This is what makes the system fundamentally different from generic QA. The agent doesn’t just test the feature described in the PR — it reads the actual code diff and derives test cases from the implementation:

Branch coverage: For every if/else, switch/case, and ternary in the diff, it creates test cases that exercise each branch.

Boundary conditions: If code checks if ( $count > 10 ), it generates tests with values 9, 10, and 11.

Error paths: It finds try/catch blocks, wp_die(), return new WP_Error(), and early return guards, then creates test cases that trigger each error path.

Security analysis:

Missing current_user_can() checks → flagged as P0 security issue
Missing wp_verify_nonce() → flagged with test case
Raw SQL without $wpdb->prepare() → SQL injection test cases
User input without sanitization → XSS test cases

Data type assumptions: If code uses intval() but input could be an array, or uses sanitize_text_field() on HTML content, it creates type mismatch test cases.

Null/empty handling: If code accesses $data['key'] without existence checks, it generates tests with missing/empty data.

The 5-Phase Analysis Methodology

Phase 1: Context Understanding

What changed? (feature, bug fix, refactor)
Why? (problem statement from PR description + linked issues)
What’s the expected outcome?
What did the author think was important to test?

Phase 2: Test Requirement Identification For each relevant area, it identifies what needs testing across 8 categories:

Functional (happy path, variations, error messages)
Integration (core platform, third-party conflicts, REST API)
UI/UX (responsive, browser compat, visual states)
Data (CRUD, validation, edge cases)
API (auth, validation, response format)
Security (XSS, SQLi, CSRF, capability checks)
Performance (load times, query efficiency, large data)
Accessibility (keyboard nav, screen reader, contrast)

Phase 3: Test Case Generation Each test case follows a strict format:

TEST CASE ID: TC-[MODULE]-[NUMBER]
MODULE: [Feature Name]
TYPE: [Functional/Integration/UI/Security/Performance/Accessibility]
PRIORITY: [P0-Critical/P1-High/P2-Medium/P3-Low]

OBJECTIVE:
[One sentence describing what this test verifies]

PRECONDITIONS:
- [Required setup, user role, data state]

TEST STEPS:
1. [Specific action]
2. [Specific action]

EXPECTED RESULT:
- [Specific, verifiable outcome]

Smart merging rules: Related scenarios targeting the same function or UI element get merged into a single test case. Instead of 4 separate cases for “empty title”, “null title”, “special chars in title”, “XSS payload in title” — you get one “Title field input validation” case with multiple verification steps.

Phase 4: Output Generation The agent uses mode-aware templates:

Default mode: 4 sections (Test Data, Test Cases, Regression Areas, Verification Commands)
Max mode: 7 sections (adds Quick Test Script, Additional Test Considerations)

Phase 5: Quality Checklist Before finalizing, the agent verifies:

All changed features covered
Error conditions and boundary cases included
Integration points covered
Access control tests present
Steps are specific and executable
Expected results are verifiable

Reference Document System

The agent reads domain-specific reference docs based on what the PR touches:

PR Touches	Agent Reads
Security, payloads, general QA	`qa-playbook.md`
REST API endpoints	`rest-api.md`
Database, migrations	`database-schema.md`
Meta titles, tag variables	`smart-tags.md`
Settings pages	`settings-map.md`
Cross-feature impact	`feature-interactions.md`

This gives the agent domain knowledge without stuffing it all into the system prompt. The baseline playbook (qa-playbook.md) is always read; additional docs are loaded conditionally.

Component 4: The QA Reference Playbook

File: .claude/docs/qa-playbook.md (your project’s equivalent)

This is the agent’s domain knowledge — a curated reference that teaches it how to test your specific application. It includes:

Security Testing Payloads

Pre-built payloads organized by attack type:

XSS payloads — basic, event handlers, encoded, and platform-specific (shortcode context, block editor context):

<script>alert('XSS')</script>
" onmouseover="alert('XSS')
%3Cscript%3Ealert(1)%3C/script%3E
[shortcode attr="<script>alert(1)</script>"]

SQL injection payloads — basic, union-based, time-based:

' OR '1'='1
' UNION SELECT user_login,user_pass,3 FROM wp_users--
' AND SLEEP(5)--

CSRF testing commands:

# Test without nonce
curl -X POST "https://site.com/wp-admin/admin-ajax.php" \
  -d "action=your_action&data=test" \
  -H "Cookie: session_token=..."

Verification Commands

Ready-to-use commands for verifying different aspects of your application:

# Check meta tags
curl -s "https://site.com/page/" | grep -oP 'name="description" content="\K[^"]+'

# Extract JSON-LD schema
curl -s "https://site.com/page/" | \
  grep -oP '<script type="application/ld\+json">\K[^<]+' | \
  python3 -m json.tool

# Test redirects
curl -sIL "https://site.com/old-url/" 2>/dev/null | grep -E "HTTP|Location"

Test Case Templates

Pre-built test case templates for each feature area (SEO meta, schema, sitemaps, social meta, redirects, etc.) that the agent uses as starting points and customizes based on the actual diff.

Priority Classification

Priority	When to Use	Response
P0	Security issue, data loss, complete failure	Block release
P1	Core feature broken, no workaround	Fix this sprint
P2	Feature works with issues, workaround exists	Fix next sprint
P3	Cosmetic, minor inconvenience	Backlog

Environment Matrix

Testing environment requirements including runtime versions, database versions, and common test scenarios (fresh install vs upgrade, single instance vs multi-tenant, etc.).

How It All Works Together

Here’s the complete flow when you run /qa-pr 1234:

User types: /qa-pr 1234
        │
        ▼
┌──────────────────────────────┐
│  qa-pr.md (Slash Command)    │
│  Parses args: PR=1234        │
│  QA_MODE=default             │
└──────────┬───────────────────┘
           │
           ▼
┌──────────────────────────────┐
│  analyze-pr.sh 1234          │
│  --output-file=qa-analysis   │
│                              │
│  ► gh pr view (metadata)     │
│  ► git diff (code changes)   │
│  ► Pattern detection (30+    │
│    checks for features,      │
│    security, DB ops, etc.)   │
│  ► Complexity scoring        │
│                              │
│  Output: qa-analysis.txt     │
└──────────┬───────────────────┘
           │
           ▼
┌──────────────────────────────┐
│  Fetch linked issues         │
│  gh issue view #456          │
│  (bug reports, context)      │
└──────────┬───────────────────┘
           │
           ▼
┌──────────────────────────────┐
│  qa-agent Agent          │
│  (Claude Sonnet)             │
│                              │
│  Inputs:                     │
│  ► qa-analysis.txt           │
│  ► Linked issue context      │
│  ► QA_MODE=default           │
│                              │
│  Reads reference docs:       │
│  ► qa-playbook.md (always)   │
│  ► rest-api.md (if API PR)   │
│  ► database-schema.md (if    │
│    DB changes)               │
│                              │
│  5-phase analysis:           │
│  1. Context understanding    │
│  2. Test requirements        │
│  3. Test case generation     │
│  4. Output formatting        │
│  5. Quality checklist        │
│                              │
│  Output: Structured test     │
│  plan with P0-P3 cases       │
└──────────┬───────────────────┘
           │
           ▼
┌──────────────────────────────┐
│  Ask user: Post to PR?       │
│                              │
│  If yes:                     │
│  ► Wrap in <details> block   │
│  ► Insert before target      │
│    section in PR body        │
│  ► gh pr edit --body-file    │
└──────────────────────────────┘

The Output: What You Get

Default Mode Output Structure

# QA Test Plan

> **Change Type:** Bug Fix | **Risk Level:** High | **Files Changed:** 5 | **Test Cases:** 12 | **Required Plugins:** WooCommerce

---

<details>
<summary><strong>Test Data</strong></summary>

### Input Values to Test
| Value | Purpose |
|-------|---------|
| Empty string "" | Tests null/empty handling in save function |
| String > 160 chars | Tests character limit boundary |
| `<script>alert(1)</script>` | Tests XSS sanitization |

</details>

---

<details>
<summary><strong>Test Cases</strong></summary>

### P0 - Critical (Must Test)

<details>
<summary><strong>TC-META-001: SEO Title saves and displays on frontend</strong></summary>

TEST CASE ID: TC-META-001
MODULE: SEO Meta
TYPE: Functional
PRIORITY: P0-Critical

OBJECTIVE:
Verify that custom SEO title saves in the editor and renders correctly in the page source.

PRECONDITIONS:
- Admin panel access
- Plugin activated
- At least one published post

TEST STEPS:
1. Navigate to Posts > Edit any published post
2. Open the SEO Settings panel
3. Enter "Custom Test Title | My Site" in the SEO Title field
4. Click Update
5. View the post on the frontend
6. View page source (Ctrl+U)

EXPECTED RESULT:
- The <title> tag contains "Custom Test Title | My Site"
- The og:title meta tag matches
- No raw variable placeholders visible

</details>

### P1 - High Priority
[More test cases...]

### P2 - Medium Priority
[More test cases...]

</details>

---

<details>
<summary><strong>Regression Areas</strong></summary>

- [ ] Existing posts retain their custom SEO titles
- [ ] Default title format still applies to posts without custom titles
- [ ] Sitemap URLs remain accessible
- [ ] Social sharing previews still work

</details>

---

<details>
<summary><strong>Verification Commands</strong></summary>

# Check title tag on frontend
curl -s "https://your-site.test/test-post/" | grep -oP '<title>\K[^<]+'

# Verify meta description
curl -s "https://your-site.test/test-post/" | grep -oP 'name="description" content="\K[^"]+'

# Check for duplicate title tags
curl -s "https://your-site.test/test-post/" | grep -c '<title>'

</details>

Step-by-Step Implementation Guide

Prerequisites

Claude Code installed
GitHub CLI (gh) installed and authenticated
A git repository with PRs on GitHub

Step 1: Create the Directory Structure

mkdir -p .claude/commands .claude/agents .claude/scripts .claude/docs

Step 2: Create the PR Analysis Script

Create .claude/scripts/analyze-pr.sh and make it executable:

chmod +x .claude/scripts/analyze-pr.sh

The script should:

Accept a PR number or branch refs as input
Validate prerequisites (git repo, gh CLI, authentication)
Fetch PR metadata via gh pr view --json
Cache the git diff upfront for performance
Categorize changed files by type (frontend, backend, API, DB, etc.)
Detect features and patterns using regex on the cached diff — look for new classes, functions, REST endpoints, database operations, form changes
Detect domain-specific changes relevant to your application (the specific patterns will vary by project)
Analyze testing requirements — flag which types of testing are needed based on file types changed
Detect platform-specific patterns — routes, middleware, hooks, cron jobs, caching, migrations, etc.
Analyze permissions — find capability/authorization checks in the diff
Analyze data flow — find input sources (user input, request data) and output points
Output the full diff with configurable context lines
Calculate a complexity score to estimate testing effort

The key principle: extract everything the AI agent needs so it doesn’t have to run git commands itself. The script is the data preparation layer.

Step 3: Create the QA Agent Definition

Create .claude/agents/qa-agent.md (or name it for your framework):

The agent definition should include:

Identity — Define the agent’s expertise and perspective
Mission — What the agent produces (test cases, not code review)
Testing approach — White-box testing methodology derived from the actual diff
Reference document pointers — Which docs to read based on what the PR touches
Input specification — What the agent receives (analysis file, issue context, mode)
Analysis methodology — The multi-phase approach (context → requirements → generation → formatting → quality check)
Test case format — Exact template with field definitions
Merging rules — When to combine related scenarios into single test cases
Output templates — Mode-aware templates (default vs max)
Guidelines — Do’s and don’ts for test case quality
Quality checklist — Final verification before output

Step 4: Create the Slash Command

Create .claude/commands/qa-pr.md:

The command should:

Parse arguments — Extract PR number and optional mode flag
Run the analysis script — Execute the shell script and capture output
Fetch linked issues — Parse PR body for Fixes #123 patterns, fetch issue details
Invoke the agent — Pass all context to the QA agent
Handle PR posting — Optionally insert the test plan into the PR description (with user confirmation)

Step 5: Create Your QA Reference Playbook

Create .claude/docs/your-qa-playbook.md:

This is the most project-specific piece. Include:

Security testing payloads relevant to your stack
Verification commands for your application’s features
Test case templates for your common change types
Priority classification matching your team’s severity levels
Environment matrix (supported versions, configurations)
Common gotchas specific to your framework/platform
Integration testing matrix (third-party dependencies)

Step 6: Create Additional Reference Documents

For each major feature area of your application, create a focused reference doc that the agent conditionally reads:

rest-api.md — All API endpoints, request/response formats, auth requirements
database-schema.md — Table structures, relationships, migration patterns
settings-map.md — All configuration options and their effects
feature-interactions.md — How features depend on each other (for regression analysis)

Step 7: Use It

# Open Claude Code in your project
claude

# Generate a test plan for PR #1234
/qa-pr 1234

# Full analysis mode
/qa-pr 1234 max

Key Design Decisions and Why

1. Shell Script for Data Extraction, AI for Analysis

The shell script handles deterministic operations (git diff, file categorization, pattern matching) while the AI handles judgment-based tasks (prioritization, edge case identification, test case writing). This separation means the AI agent receives clean, structured input and can focus on what it’s best at.

2. Cached Git Operations

The script caches git diff output at startup and runs all subsequent pattern matching on the cached strings. Without this, the 30+ detection checks would each spawn a new git diff process.

3. Sonnet for the Agent, Default Model for Orchestration

The QA agent generates long, structured output where Sonnet excels. The orchestrator (slash command) handles judgment calls like parsing arguments and deciding whether to post to PR — tasks where the default model’s reasoning is valuable.

4. Conditional Reference Document Loading

Instead of stuffing all domain knowledge into the agent’s system prompt, the agent reads reference docs based on what the PR actually touches. A meta-only PR doesn’t need to load the database schema reference.

5. White-Box Over Black-Box

The agent receives the full code diff, not just a feature description. This enables test cases that trace back to specific code branches, boundary conditions, and error paths — not just “test that the feature works.”

6. Linked Issue Integration Without Dumping

Issue context enriches test cases but doesn’t appear as a separate section. Bug reports become better P0 test cases with specific reproduction steps, not copy-pasted issue descriptions.

7. Test Case Merging Rules

Related scenarios (empty, null, special chars, XSS for the same field) merge into one test case. This matches how QA testers actually work — you test one area thoroughly, then move on.

8. Idempotent PR Posting

The command can be run multiple times on the same PR. It strips any existing test plan before inserting the new one, so you always get a clean update.

Customizing for Your Project

Adapting for Other Frameworks

The architecture is framework-agnostic. To adapt it:

Shell script: Change the pattern detection section to match your framework’s patterns (Express routes, Django views, React components, etc.)
Agent definition: Update the testing knowledge to match your stack’s testing patterns (middleware testing, API gateway checks, state management testing, etc.)
Reference playbook: Update security payloads for your stack’s common vulnerabilities and CLI tools.
Reference docs: Create docs for your project’s architecture — API endpoints, database schema, settings, feature interactions.

For Different Team Sizes

Solo developer: Use default mode only. Focus the playbook on your most critical features.
Small team: Add the PR posting feature. Use max mode for high-risk PRs.
Large team: Create multiple reference docs per domain area. Consider adding automated Playwright test generation as a follow-up step.

For Different PR Workflows

If you don’t use “Beta Builds” section: Change the insertion point in the slash command to wherever you want the test plan placed in the PR body.
If you want the test plan as a PR comment instead: Replace gh pr edit --body-file with gh pr comment --body-file.
If you want it in a separate file: Skip the PR posting step and just save the output.

Conclusion

Building this system took iteration, but the architecture is straightforward:

A shell script that extracts structured data from PRs
A custom agent that understands QA testing and your domain
A slash command that orchestrates the pipeline
Reference documents that give the agent domain expertise

The result is a system where typing /qa-pr 1234 produces a comprehensive, prioritized, white-box test plan in under a minute — something that would take a QA engineer hours to produce manually.

The key insight is that AI is best at synthesis and judgment, not data extraction. Let shell scripts handle the deterministic parts (git diff, pattern matching), and let the AI focus on what it’s uniquely good at: reading code, understanding intent, identifying edge cases, and structuring test plans that a human can actually execute.

Every project has different features, patterns, and risk areas. The value isn’t in copying these files verbatim — it’s in understanding the architecture and adapting it to your codebase. Start with the shell script, add the agent, build your playbook, and iterate from there.

Apr 2, 2026

If you found this post helpful, consider buying me a coffee. It keeps me writing!

« How I Built a Multi-Site Docker Testing Environment for WordPress Plugin Testing How I Built a Self-Updating Knowledge Base That Keeps My AI QA Agent Always in Sync With the Codebase »