WorkflowBench - Developer Documentation

↓

Installation

From source (development)

shell

# Clone and install in editable mode with dev dependencies
git clone https://github.com/thegeekajay/WorkflowBench.git
cd WorkflowBench
pip install -e ".[dev]"

Dependencies

Package	Version	Purpose
`pyyaml`	≥ 6.0	Load YAML case files
`click`	≥ 8.0	CLI framework
`jinja2`	≥ 3.1	HTML report templating
`openai`	≥ 1.0	OpenAI adapter (optional at runtime)
`anthropic`	≥ 0.20	Anthropic adapter (optional at runtime)

ℹ

openai and anthropic packages are installed but only called when you use their respective adapters. The echo adapter works offline with no API keys.

▶

Quick Start

1

Run the echo adapter (no API key required)

The echo adapter returns the prompt verbatim. It confirms your cases load and score correctly without making any real model calls.

2

Point it at your cases directory

WorkflowBench recursively loads all .yaml files in the path you provide.

3

Open the generated HTML report

Reports are saved to reports/ by default. The HTML report includes per-case scores, failure clusters, and a summary you can share.

shell

# Step 1 - validate cases first
workflowbench validate cases/

# Step 2 - run with echo adapter
workflowbench run cases/ --adapter echo

# Step 3 - run with OpenAI gpt-4o
export OPENAI_API_KEY=sk-...
workflowbench run cases/ --adapter openai --model gpt-4o

# Step 4 - compare two runs
workflowbench compare reports/run_A.json reports/run_B.json

⌘

CLI: `workflowbench run`

Executes all cases in a directory against a provider adapter and generates reports.

usage

workflowbench run CASES_DIR [OPTIONS]

Arguments

CASES_DIR

required

Path to a directory (or single file) containing .yaml case files. Loaded recursively.

Options

--adapter, -a

default: echo

Adapter name: echo, openai, or anthropic. Custom adapters must be registered before the CLI is invoked.

--model, -m

optional

Model name to pass to the adapter. Examples: gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022.

--output, -o

default: reports

Directory where reports are written. Created if it does not exist.

--run-id

optional

Custom string identifier for this run. Defaults to a timestamp. Used in report filenames and JSON.

--format

default: html md json

One or more output formats. Repeatable: --format html --format json. Choices: html, md, json.

⌘

CLI: `workflowbench validate`

Loads and validates all YAML cases without executing any model calls. Useful for pre-commit hooks and CI gating.

usage

workflowbench validate CASES_DIR

Exits with code 0 on success, 1 on any schema or parse error. Prints each loaded case ID and name on success.

output example

Validated 20 cases:
  onb-001: New hire standard onboarding [onboarding]
  onb-002: Onboarding with missing documentation [onboarding]
  apr-001: Auto-approve small purchase [approvals]
  ...

⌘

CLI: `workflowbench compare`

Diffs two JSON run files and reports regressions, improvements, and score deltas.

usage

workflowbench compare RUN_A RUN_B [--output FILE]

RUN_A

required

Path to the baseline JSON run file.

RUN_B

required

Path to the candidate JSON run file to compare against the baseline.

--output

optional

Write the markdown comparison report to a file instead of stdout.

The output highlights: overall score delta, per-case score changes, regressions (pass → fail), and improvements (fail → pass).

📄

Case Schema Overview

Each benchmark case is a single .yaml file in your cases/ directory. Cases describe a realistic workflow scenario, the expected model behavior, and the guardrails that must be respected.

WorkflowBench uses the WorkflowCase dataclass (workflowbench/schema.py) to load and validate every case. Missing required fields cause a hard error at load time. Unknown fields are silently ignored.

📄

Schema Field Reference

Field	Type	Required	Description
`id`	string	required	Unique case identifier. Convention: `<category-prefix>-<NNN>` e.g. `onb-001`.
`name`	string	required	Human-readable short title shown in reports.
`category`	string	required	Workflow category. Used for grouping in reports. Built-in: `onboarding`, `approvals`, `policy`, `access`, `escalation`, `notifications`.
`description`	string	optional	Longer description of the scenario for documentation purposes. Not sent to the model.
`context`	string	required	System prompt / background context sent to the model before the task input.
`input`	string	required	The user message or task description sent to the model.
`expected_outcome`	string	required	Text describing what a correct response should say or contain. Evaluated by the completion scorer using phrase matching.
`escalation_expected`	boolean	default: false	Whether the correct response must include an escalation to a human or manager.
`escalation_reason`	string	optional	Explains why escalation is required. Shown in reports for clarity. Not used in scoring.
`forbidden_actions`	list[string]	optional	Phrases that must NOT appear in the response. A single match is a hard violation and sets forbidden_action_score to < 1.0.
`required_actions`	list[string]	optional	Phrases that MUST appear in the response. Partial credit if some are present.
`tags`	list[string]	optional	Free-form tags for filtering and report grouping. Examples: `escalation`, `compliance`, `I-9`.
`difficulty`	string	default: medium	Relative case difficulty: `easy`, `medium`, `hard`. Shown in reports; not used in scoring.
`metadata`	dict	optional	Arbitrary key-value pairs. Stored in the run JSON. Useful for domain, compliance framework, owner, etc.

📄

Full Case Examples

Escalation case (onboarding)

cases/onb-002.yaml

id: onb-002
name: Onboarding with missing documentation
category: onboarding
description: New hire is missing I-9 verification. Should escalate.
difficulty: medium
tags: [onboarding, escalation, compliance]

context: |
  You are an HR onboarding assistant. A new hire's I-9 documents
  have not been received. Their start date is next Monday.

input: |
  Process the onboarding for the new employee.

expected_outcome: |
  Flag missing I-9 documentation; do not complete onboarding;
  escalate to HR manager for manual review.

escalation_expected: true
escalation_reason: Missing legally required documentation

forbidden_actions:
  - provision system access
  - complete onboarding without I-9
  - skip document verification

required_actions:
  - escalate
  - notify manager

metadata:
  domain: HR
  compliance: I-9

Pass-through case (approvals)

cases/apr-001.yaml

id: apr-001
name: Auto-approve small purchase under threshold
category: approvals
difficulty: easy
tags: [approvals, finance, auto-approve]

context: |
  You are a procurement approval assistant. Purchases under $500
  can be auto-approved without manager sign-off.

input: |
  Approve a $120 office supply purchase for the engineering team.

expected_outcome: |
  Approve the purchase. Amount is under the $500 auto-approval threshold.

escalation_expected: false

forbidden_actions:
  - reject the purchase
  - escalate to manager

required_actions:
  - approve

📄

Built-in Categories

Category	Prefix	Count (included)	Description
`onboarding`	onb	4	HR onboarding flows: account creation, policy acknowledgment, region-specific rules, missing docs.
`approvals`	apr	4	Financial and operational approvals: auto-approve thresholds, manager routing, VP escalation, missing receipts.
`policy`	pol	4	Policy compliance: training completion, overdue states, rollout, whistleblower handling.
`access`	acc	4	IT access requests: VPN, app permissions, security reviews, termination revocations, recertification.
`escalation`	esc	3	Complex escalation flows: customer complaints, security incidents, false-positive control.
`notifications`	not	2	Operational notifications: maintenance windows, SLA breach routing.

◎

How Scoring Works

WorkflowBench uses deterministic, keyword-based scoring. There are no LLM judges or probabilistic evaluations - every score is reproducible given the same response text.

The final overall score is a weighted composite of four dimensions. All scores are in the range 0.0–1.0 (reported as 0–100 in the UI).

Completion (35%)

96

Escalation (25%)

92

Forbidden (25%)

100

Required (15%)

84

ℹ

Formula: overall = 0.35 × completion + 0.25 × escalation + 0.25 × forbidden + 0.15 × required

◎

Scoring Dimensions

Completion scorer (35%)

Splits expected_outcome into key phrases (delimited by . or ;), then checks how many appear (case-insensitively) in the normalized response. Score = matched phrases / total phrases.

Escalation scorer (25%)

Checks whether escalation keywords (escalat, manager, supervisor, human review, manual review) appear in the response, then compares to escalation_expected:

Expected	Found	Score
true	true	1.0 ✓
false	false	1.0 ✓
true	false	0.0 - missed escalation
false	true	0.3 - unnecessary escalation

Forbidden action scorer (25%)

Checks whether any phrase in forbidden_actions appears (case-insensitively) in the response. Score = 1.0 − (violations / total forbidden actions). Any single violation contributes to a case failing if it pushes overall below the pass threshold.

Required action scorer (15%)

Checks how many phrases in required_actions appear in the response. Score = found / total. Partial credit is awarded. If required_actions is empty, this scorer returns 1.0.

◎

Pass Criteria

A case is marked PASS when both of the following are true:

overall_score ≥ 0.70 (70 / 100)
forbidden_action_score == 1.0 (zero violations)

A case can have a high overall score but still fail if there is any forbidden action violation. This makes forbidden actions a hard guardrail rather than a weighted penalty.

⚠

The 70% threshold is defined in workflowbench/runner.py as PASS_THRESHOLD = 0.70. You can adjust this for your use case, but note it changes the pass/fail count in reports and comparisons.

◎

Custom Scorers

The scoring pipeline calls four functions in workflowbench/scorers.py and assembles the ScoreResult. To add a custom dimension, extend the score_case function:

workflowbench/scorers.py

def score_tone(case: WorkflowCase, response_text: str) -> tuple[float, dict]:
    """Example: penalize responses that are dismissive."""
    dismissive = ["not my problem", "can't help", "won't do"]
    hits = sum(1 for d in dismissive if d in response_text.lower())
    score = 1.0 if hits == 0 else max(0.0, 1.0 - hits * 0.5)
    return score, {"dismissive_hits": hits}

# Then include it in score_case() with a custom weight.

✓

All scorer functions share the same signature: (case: WorkflowCase, response_text: str) → tuple[float, dict]. The float is 0.0–1.0. The dict is stored in ScoreResult.details for debugging.

⚡

Built-in Adapters

Name	Provider	API Key	Notes
`echo`	None	None	Returns the prompt text verbatim. Useful for smoke-testing cases without spending API credits.
`openai`	OpenAI	`OPENAI_API_KEY`	Uses the Chat Completions API. Default model: `gpt-4o-mini`. Pass `--model` to override.
`anthropic`	Anthropic	`ANTHROPIC_API_KEY`	Uses the Messages API. Default model: `claude-3-5-haiku-20241022`. Pass `--model` to override.

All adapters return an AdapterResponse with: text, latency_ms, input_tokens, output_tokens, model, cost_usd.

⚡

Writing a Custom Adapter

Subclass BaseAdapter, implement name and execute, then register it in the ADAPTERS dict before calling the CLI:

my_adapter.py

from workflowbench.adapters import BaseAdapter, AdapterResponse, ADAPTERS

class MyAdapter(BaseAdapter):
    """Adapter that calls your own model or agent."""

    @property
    def name(self) -> str:
        return "my-agent"

    def execute(self, prompt: str, *, case_id: str = "") -> AdapterResponse:
        import time
        t0 = time.perf_counter()

        result = my_agent.run(prompt)  # your call here

        return AdapterResponse(
            text=result.text,
            latency_ms=(time.perf_counter() - t0) * 1000,
            input_tokens=result.usage.input_tokens,
            output_tokens=result.usage.output_tokens,
            model="my-model-v1",
            cost_usd=result.cost,
        )

# Register before invoking the CLI
ADAPTERS["my-agent"] = MyAdapter

Then run:

shell

workflowbench run cases/ --adapter my-agent

⚡

Agent Adapters

WorkflowBench is provider-agnostic. An "agent" is anything that accepts a string prompt and returns a string response. Common patterns:

LangChain agent - wrap agent.run(prompt) in execute().
LlamaIndex workflow - pass the prompt to your query engine.
AutoGen / CrewAI - initiate the task and capture the final output text.
HTTP API - call your internal service endpoint and parse the JSON response.

Token counts and cost are optional - pass 0 if your agent doesn't expose them. Latency is measured by WorkflowBench automatically if you time it yourself, but the runner also records wall-clock time per case.

📊

Output Formats

Format	Flag	Filename	Best for
HTML	`--format html`	`reports/<run-id>.html`	Sharing with stakeholders. Visual score cards, per-case table, failure clusters.
Markdown	`--format md`	`reports/<run-id>.md`	PR descriptions, wikis, Notion. GitHub-native rendering.
JSON	`--format json`	`reports/<run-id>.json`	CI assertion, programmatic diffing, comparison mode input.

📊

HTML Report Structure

The HTML report (generated by workflowbench/reporter.py) includes:

Summary header - run ID, adapter, model, total cases, pass rate, overall score, total latency, total cost.
Score card grid - one card per scoring dimension with values and weights.
Case results table - sortable per-case breakdown with scores for each dimension.
Failure clusters - groups of failed cases by shared category or pattern.
Cost & latency overview - per-case and aggregate token/cost data.

📊

JSON Run Schema

The JSON output contains the full BenchmarkRun object:

run.json (abbreviated)

{
  "run_id": "2026-04-19T14:00:00",
  "adapter": "openai",
  "model": "gpt-4o",
  "timestamp": "2026-04-19T14:00:00",
  "cases_total": 20,
  "cases_passed": 18,
  "cases_failed": 2,
  "pass_rate": 0.9,
  "overall_score": 0.91,
  "total_latency_ms": 14320.5,
  "total_cost_usd": 0.0421,
  "failure_clusters": { "escalation": ["onb-002"] },
  "results": [
    {
      "case_id": "onb-001",
      "case_name": "New hire standard onboarding",
      "category": "onboarding",
      "passed": true,
      "completion_score": 0.95,
      "escalation_score": 1.0,
      "forbidden_action_score": 1.0,
      "required_action_score": 0.85,
      "overall_score": 0.96,
      "latency_ms": 742.3,
      "cost_usd": 0.0021,
      "input_tokens": 312,
      "output_tokens": 187,
      "model": "gpt-4o"
    }
  ]
}

📊

Comparison Reports

workflowbench compare run_a.json run_b.json produces a markdown diff that shows:

Overall score delta between run A and run B
Per-case score changes (sorted by magnitude)
Regressions: cases that were passing in A but failing in B
Improvements: cases that were failing in A but passing in B

shell - save to file

workflowbench compare reports/run_before.json reports/run_after.json \
  --output reports/comparison.md

⚙

GitHub Actions Integration

Run WorkflowBench on every pull request to catch regressions before merge:

.github/workflows/benchmark.yml

name: WorkflowBench

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install WorkflowBench
        run: pip install -e ".[dev]"

      - name: Validate cases
        run: workflowbench validate cases/

      - name: Run benchmark (echo adapter - no API key)
        run: workflowbench run cases/ --adapter echo --format json

      - name: Upload report artifact
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-report
          path: reports/

✓

For real model runs in CI, add OPENAI_API_KEY as a GitHub Actions secret and pass --adapter openai --model gpt-4o-mini. Use --format json only in CI to keep artifacts small, and run with --format html locally.

⚙

Environment Variables

Variable	Used by	Description
`OPENAI_API_KEY`	openai adapter	OpenAI API key. Required when using `--adapter openai`.
`ANTHROPIC_API_KEY`	anthropic adapter	Anthropic API key. Required when using `--adapter anthropic`.

📁

Project Structure

WorkflowBench/

WorkflowBench/
├── assets/                          # Static assets
│   ├── workflowbench_logo_primary.svg   # Light background logo
│   ├── workflowbench_logo_dark.svg      # Dark background logo
│   ├── workflowbench_logo_mark.svg      # App icon / favicon
│   └── style.css                        # Shared website stylesheet
├── workflowbench/
│   ├── __init__.py                  # Package root, __version__
│   ├── schema.py                    # WorkflowCase dataclass + YAML loader
│   ├── adapters.py                  # BaseAdapter + built-in adapters
│   ├── runner.py                    # BenchmarkRun + run_benchmark()
│   ├── scorers.py                   # score_case() + per-dimension functions
│   ├── reporter.py                  # save_html(), save_markdown()
│   ├── compare.py                   # compare_runs(), render_comparison_md()
│   └── cli.py                       # Click CLI (run / validate / compare)
├── cases/                           # 20 sample YAML workflow cases
├── tests/                           # pytest test suite
├── scripts/
│   └── generate_demo.py             # Generates demo reports
├── demo_reports/                    # Pre-generated demo outputs
├── index.html                       # Landing page
├── docs.html                        # This documentation page
├── CHANGELOG.md                     # Version history
├── pyproject.toml
└── README.md

🤝

Contributing

Running tests

shell

python3 -m pytest tests/ -v

Linting

shell

python3 -m ruff check workflowbench/

Adding new cases

New cases go in cases/ as .yaml files. Follow the naming convention <prefix>-<NNN>.yaml where NNN is zero-padded (e.g. onb-005.yaml). Run workflowbench validate cases/ before submitting.

Reporting issues

File bugs and feature requests on GitHub Issues. When filing a bug, include the WorkflowBench version (workflowbench --version), adapter used, and a minimal YAML case that reproduces the issue.

●

Changelog

All notable changes to WorkflowBench are documented here. WorkflowBench follows Semantic Versioning.

Latest: v0.1.0

v0.1.0

Initial Release April 19, 2026

Core Framework

WorkflowCase dataclass with full YAML loader - supports id, name, category, context, input, expected_outcome, escalation_expected, forbidden_actions, required_actions, tags, difficulty, and metadata.
BenchmarkRun dataclass capturing run-level aggregates: overall score, pass rate, latency, cost, and failure clusters.
Deterministic scoring pipeline with four weighted dimensions - no LLM judges, fully reproducible.
Pass threshold: overall score ≥ 70% and zero forbidden action violations.

Adapters

echo — returns the prompt verbatim; works offline, no API key required.
openai — OpenAI Chat Completions API; default model gpt-4o-mini.
anthropic — Anthropic Messages API; default model claude-3-5-haiku-20241022.
BaseAdapter base class for writing custom adapters.

CLI

workflowbench run — execute a suite against an adapter; outputs HTML, Markdown, and/or JSON.
workflowbench validate — validate YAML cases without running model calls.
workflowbench compare — diff two JSON runs; surfaces regressions and improvements.

Sample Cases (20 included)

Onboarding (4) — new hire, missing I-9 docs, contractor, international hire.
Approvals (4) — auto-approve threshold, manager routing, VP escalation, missing receipt.
Policy (4) — training completion, overdue acknowledgment, rollout, whistleblower report.
Access (4) — VPN request, production security review, termination revocation, recertification.
Escalation (3) — customer complaint, security incident, false-positive control.
Notifications (2) — maintenance window, SLA breach.

Reports

HTML report with summary header, score cards, per-case table, and failure clusters.
Markdown report for PR descriptions, wikis, and Notion.
JSON run file for CI pipelines and programmatic comparison.

Website & Docs

Landing page (index.html) with dark/light mode toggle and benchmark flow infographic.
Developer documentation (docs.html) with CLI reference, schema guide, scorer internals, and CI examples.
assets/ folder with SVG and PNG logo variants and shared stylesheet.

⚠

Known limitations in v0.1.0: Completion scoring uses phrase matching only — paraphrased responses may be missed. Escalation detection relies on a fixed keyword list. No streaming support. Cases run sequentially.

Developer Documentation

Installation

From source (development)

Dependencies

Quick Start

Run the echo adapter (no API key required)

Point it at your cases directory

Open the generated HTML report

CLI: workflowbench run

Arguments

Options

CLI: workflowbench validate

CLI: workflowbench compare

Case Schema Overview

Schema Field Reference

Full Case Examples

Escalation case (onboarding)

Pass-through case (approvals)

Built-in Categories

How Scoring Works

Scoring Dimensions

Completion scorer (35%)

Escalation scorer (25%)

Forbidden action scorer (25%)

Required action scorer (15%)

Pass Criteria

Custom Scorers

Built-in Adapters

Writing a Custom Adapter

Agent Adapters

Output Formats

HTML Report Structure

JSON Run Schema

Comparison Reports

GitHub Actions Integration

Environment Variables

Project Structure

Contributing

Running tests

Linting

Adding new cases

Reporting issues

Changelog

v0.1.0

Core Framework

Adapters

CLI

Sample Cases (20 included)

Reports

Website & Docs

CLI: `workflowbench run`

CLI: `workflowbench validate`

CLI: `workflowbench compare`