Usage Guide

Learn how to configure and use Kaizen Agent effectively. This guide covers YAML configuration structure, CLI commands, and advanced features.

YAML Configuration Structure

Kaizen Agent uses YAML configuration files to define test suites for your AI agents. This approach eliminates the need for traditional Python test files while providing powerful testing capabilities.

Complete Configuration Example

Here's a comprehensive example that demonstrates all available configuration options:

name: Text Analysis Agent Test Suite
agent_type: dynamic_region
file_path: agents/text_analyzer.py
description: |
  Test suite for the TextAnalyzer agent that processes and analyzes text content.
  
  This agent performs sentiment analysis, extracts key information, and provides
  structured analysis results. Tests cover various input types, edge cases, and
  expected output formats to ensure reliable performance.

agent:
  module: agents.text_analyzer
  class: TextAnalyzer
  method: analyze_text

evaluation:
  evaluation_targets:
    - name: sentiment_score
      source: variable
      criteria: "The sentiment_score must be a float between -1.0 and 1.0. Negative values indicate negative sentiment, positive values indicate positive sentiment. The score should accurately reflect the emotional tone of the input text."
      description: "Evaluates the accuracy of sentiment analysis output"
      weight: 0.4
    - name: key_phrases
      source: variable
      criteria: "The key_phrases should be a list of strings containing the most important phrases from the input text"
      description: "Checks if key phrase extraction is working correctly"
      weight: 0.3
    - name: analysis_quality
      source: return
      criteria: "The response should be well-structured, professional, and contain actionable insights"
      description: "Evaluates the overall quality and usefulness of the analysis"
      weight: 0.3

max_retries: 3

files_to_fix:
  - agents/text_analyzer.py
  - agents/prompts.py

referenced_files:
  - agents/prompts.py
  - utils/text_utils.py

steps:
  - name: Positive Review Analysis
    description: "Analyze a positive customer review"
    input:
      file_path: agents/text_analyzer.py
      method: analyze_text
      input: 
        - name: text_content
          type: string
          value: "This product exceeded my expectations! The quality is outstanding and the customer service was excellent. I would definitely recommend it to others."
          
      expected_output: 
        sentiment_score: 0.8
        key_phrases: ["exceeded expectations", "outstanding quality", "excellent customer service"]

  - name: Negative Feedback Analysis
    description: "Analyze negative customer feedback"
    input:
      file_path: agents/text_analyzer.py
      method: analyze_text
      input: 
        - name: text_content
          type: string
          value: "I'm very disappointed with this purchase. The product arrived damaged and the support team was unhelpful."
          
      expected_output: 
        sentiment_score: -0.7
        key_phrases: ["disappointed", "damaged product", "unhelpful support"]

  - name: Object Input Analysis
    description: "Analyze text using a structured user review object"
    input:
      file_path: agents/text_analyzer.py
      method: analyze_review
      input: 
        - name: user_review
          type: object
          class_path: agents.review_processor.UserReview
          args: 
            text: "This product exceeded my expectations! The quality is outstanding."
            rating: 5
            category: "electronics"
            helpful_votes: 12
            verified_purchase: true
        - name: analysis_settings
          type: dict
          value:
            include_sentiment: true
            extract_keywords: true
            detect_emotions: false
          
      expected_output: 
        sentiment_score: 0.9
        key_phrases: ["exceeded expectations", "outstanding quality", "excellent customer service"]
        review_quality: "high"

Configuration Sections Explained

Basic Information

name: A descriptive name for your test suite
agent_type: Type of agent testing (e.g., dynamic_region for code-based agents)
file_path: Path to the main agent file being tested
description: Detailed description of what the agent does and what the tests cover

Agent Configuration

agent:
  module: agents.text_analyzer    # Python module path
  class: TextAnalyzer            # Class name to instantiate
  method: analyze_text           # Method to call during testing

Evaluation Criteria

⚠️ CRITICAL: This section feeds directly into the LLM for automated evaluation. Write clear, specific criteria for best results.

The evaluation section defines how Kaizen's LLM evaluates your agent's performance. Each evaluation_target specifies what to check and how to score it.

evaluation:
  evaluation_targets:
    - name: sentiment_score       # Name of the output to evaluate
      source: variable            # Source: 'variable' (from agent output) or 'return' (from method return)
      criteria: "Description of what constitutes a good result"
      description: "Additional context about this evaluation target"
      weight: 0.4                 # Relative importance (0.0 to 1.0)

Key Components

name: Must match a field in your agent's output or return value
source:
- variable: Extract from agent's output variables/attributes
- return: Use the method's return value
criteria: Most important - Instructions for the LLM evaluator
description: Additional context to help the LLM understand the evaluation
weight: Relative importance (0.0 to 1.0, total should equal 1.0)

Writing Effective Criteria

✅ Good Examples:

- name: sentiment_score
  source: variable
  criteria: "The sentiment_score must be a float between -1.0 and 1.0. Negative values indicate negative sentiment, positive values indicate positive sentiment. The score should accurately reflect the emotional tone of the input text."
  weight: 0.4

- name: response_quality
  source: return
  criteria: "The response should be professional, well-structured, and contain actionable insights. It must be free of grammatical errors and provide specific, relevant information that addresses the user's query directly."
  weight: 0.6

❌ Poor Examples:

- name: result
  source: return
  criteria: "Should be good"  # Too vague
  weight: 1.0

- name: accuracy
  source: variable
  criteria: "Check if it's correct"  # Not specific enough
  weight: 1.0

Tips for Better LLM Evaluation

Be Specific: Include exact requirements, ranges, or formats
Provide Context: Explain what "good" means in your domain
Include Examples: Reference expected patterns or behaviors
Consider Edge Cases: Mention how to handle unusual inputs
Use Clear Language: Avoid ambiguous terms that LLMs might misinterpret

Testing Configuration

max_retries: Number of retry attempts if a test fails
files_to_fix: Files that Kaizen can modify to fix issues
referenced_files: Additional files for context (not modified)

Test Steps

Each step defines a test case with:

name: Descriptive name for the test
description: What this test is checking
input:
- file_path: Path to the agent file
- method: Method to call
- input: List of parameters with name, type, and value

Input Types Supported

Kaizen supports multiple input types for test parameters:

String Input

- name: text_content
  type: string
  value: "Your text here"

Dictionary Input

- name: config
  type: dict
  value:
    key1: "value1"
    key2: "value2"

Object Input

- name: user_review
  type: object
  class_path: agents.review_processor.UserReview
  args: 
    text: "This product exceeded my expectations! The quality is outstanding."
    rating: 5
    category: "electronics"
    helpful_votes: 12
    verified_purchase: true

The class_path specifies the Python class to instantiate, and args provides the constructor arguments.

expected_output: Expected results for evaluation

CLI Commands

Basic Testing Commands

# Run tests
kaizen test-all --config kaizen.yaml

# With auto-fix
kaizen test-all --config kaizen.yaml --auto-fix

# Create PR with fixes
kaizen test-all --config kaizen.yaml --auto-fix --create-pr

# Save detailed logs
kaizen test-all --config kaizen.yaml --save-logs

Environment Setup Commands

# Check environment setup
kaizen setup check-env

# Create environment example file
kaizen setup create-env-example

GitHub Integration Commands

# Test GitHub access
kaizen test-github-access --repo owner/repo-name

# Diagnose GitHub access issues
kaizen diagnose-github-access --repo owner/repo-name

Command Options

Option	Description	Example
`--config`	Path to YAML configuration file	`--config kaizen.yaml`
`--auto-fix`	Automatically fix issues found during testing	`--auto-fix`
`--create-pr`	Create a pull request with fixes (requires GitHub setup)	`--create-pr`
`--save-logs`	Save detailed execution logs to `test-logs/` directory	`--save-logs`
`--repo`	GitHub repository for PR creation (format: owner/repo-name)	`--repo myuser/myproject`

Simple Configuration Template

For quick testing, you can use this minimal template:

name: My Agent Test
file_path: my_agent.py
description: "Test my AI agent"

agent:
  module: my_agent
  class: MyAgent
  method: process

evaluation:
  evaluation_targets:
    - name: result
      source: return
      criteria: "The response should be accurate and helpful"
      weight: 1.0

files_to_fix:
  - my_agent.py

steps:
  - name: Basic Test
    input:
      file_path: my_agent.py
      method: process
      input: 
        - name: user_input
          type: string
          value: "Hello, how are you?"
      expected_output: 
        result: "I'm doing well, thank you!"

Best Practices

1. Write Clear Evaluation Criteria

The quality of your evaluation criteria directly impacts how well Kaizen can improve your agent. Be specific and comprehensive.

2. Test Edge Cases

Include tests for:

Empty or minimal inputs
Very long inputs
Unusual or unexpected inputs
Boundary conditions

3. Use Descriptive Test Names

Good test names help you understand what's being tested:

✅ Professional Email Improvement
✅ Edge Case - Empty Email
❌ Test 1
❌ Basic Test

4. Balance Test Coverage

Include a mix of:

Happy path tests (normal, expected inputs)
Edge case tests (unusual inputs)
Error condition tests (invalid inputs)

5. Review Generated Fixes

Always review the fixes Kaizen generates before applying them to production code.

Advanced Features

Custom Evaluation Functions

You can define custom evaluation logic for complex scenarios:

evaluation:
  evaluation_targets:
    - name: custom_score
      source: custom_function
      function: my_evaluator.evaluate_response
      criteria: "Custom evaluation criteria"
      weight: 0.5

Parallel Testing

For large test suites, you can enable parallel execution:

settings:
  parallel: true
  max_workers: 4

Timeout Configuration

Set timeouts for long-running tests:

settings:
  timeout: 180  # seconds

Troubleshooting

Common Issues

Import Errors: Ensure your agent module can be imported correctly
API Key Issues: Verify your GOOGLE_API_KEY is set and valid
File Path Issues: Check that all file paths in your YAML are correct
Evaluation Failures: Review your evaluation criteria for clarity

Debug Mode

Enable debug mode for more detailed output:

kaizen test-all --config kaizen.yaml --debug

Log Analysis

Use the log analyzer to understand test failures:

kaizen analyze-logs test-logs/latest_test.json

For more help, see our FAQ or join our Discord community.

YAML Configuration Structure​

Complete Configuration Example​

Configuration Sections Explained​

Basic Information​

Agent Configuration​

Evaluation Criteria​

Key Components​

Writing Effective Criteria​

Tips for Better LLM Evaluation​

Testing Configuration​

Test Steps​

Input Types Supported​

String Input​

Dictionary Input​

Object Input​

CLI Commands​

Basic Testing Commands​

Environment Setup Commands​

GitHub Integration Commands​

Command Options​

Simple Configuration Template​

Best Practices​

1. Write Clear Evaluation Criteria​

2. Test Edge Cases​

3. Use Descriptive Test Names​

4. Balance Test Coverage​

5. Review Generated Fixes​

Advanced Features​

Custom Evaluation Functions​

Parallel Testing​

Timeout Configuration​

Troubleshooting​

Common Issues​

Debug Mode​

Log Analysis​