Developer Documentation
WorkflowBench is a lightweight, open-source benchmark harness for AI-driven business workflows. This reference covers everything you need to define cases, run suites, interpret scores, write custom adapters, and integrate with CI.
Version: 0.1.0 • Python: 3.9+ • License: MIT
Installation
From source (development)
# Clone and install in editable mode with dev dependencies
git clone https://github.com/thegeekajay/WorkflowBench.git
cd WorkflowBench
pip install -e ".[dev]"
Dependencies
| Package | Version | Purpose |
|---|---|---|
pyyaml | ≥ 6.0 | Load YAML case files |
click | ≥ 8.0 | CLI framework |
jinja2 | ≥ 3.1 | HTML report templating |
openai | ≥ 1.0 | OpenAI adapter (optional at runtime) |
anthropic | ≥ 0.20 | Anthropic adapter (optional at runtime) |
openai and anthropic packages are installed but only called when you use their respective adapters. The echo adapter works offline with no API keys.
Quick Start
Run the echo adapter (no API key required)
The echo adapter returns the prompt verbatim. It confirms your cases load and score correctly without making any real model calls.
Point it at your cases directory
WorkflowBench recursively loads all .yaml files in the path you provide.
Open the generated HTML report
Reports are saved to reports/ by default. The HTML report includes per-case scores, failure clusters, and a summary you can share.
# Step 1 - validate cases first
workflowbench validate cases/
# Step 2 - run with echo adapter
workflowbench run cases/ --adapter echo
# Step 3 - run with OpenAI gpt-4o
export OPENAI_API_KEY=sk-...
workflowbench run cases/ --adapter openai --model gpt-4o
# Step 4 - compare two runs
workflowbench compare reports/run_A.json reports/run_B.json
CLI: workflowbench run
Executes all cases in a directory against a provider adapter and generates reports.
workflowbench run CASES_DIR [OPTIONS]
Arguments
.yaml case files. Loaded recursively.
Options
echo, openai, or anthropic. Custom adapters must be registered before the CLI is invoked.
gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022.
--format html --format json. Choices: html, md, json.
CLI: workflowbench validate
Loads and validates all YAML cases without executing any model calls. Useful for pre-commit hooks and CI gating.
workflowbench validate CASES_DIR
Exits with code 0 on success, 1 on any schema or parse error. Prints each loaded case ID and name on success.
Validated 20 cases:
onb-001: New hire standard onboarding [onboarding]
onb-002: Onboarding with missing documentation [onboarding]
apr-001: Auto-approve small purchase [approvals]
...
CLI: workflowbench compare
Diffs two JSON run files and reports regressions, improvements, and score deltas.
workflowbench compare RUN_A RUN_B [--output FILE]
The output highlights: overall score delta, per-case score changes, regressions (pass → fail), and improvements (fail → pass).
Case Schema Overview
Each benchmark case is a single .yaml file in your cases/ directory.
Cases describe a realistic workflow scenario, the expected model behavior, and the guardrails that must be respected.
WorkflowBench uses the WorkflowCase dataclass (workflowbench/schema.py) to load and validate every case.
Missing required fields cause a hard error at load time. Unknown fields are silently ignored.
Schema Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | required | Unique case identifier. Convention: <category-prefix>-<NNN> e.g. onb-001. |
name |
string | required | Human-readable short title shown in reports. |
category |
string | required | Workflow category. Used for grouping in reports. Built-in: onboarding, approvals, policy, access, escalation, notifications. |
description |
string | optional | Longer description of the scenario for documentation purposes. Not sent to the model. |
context |
string | required | System prompt / background context sent to the model before the task input. |
input |
string | required | The user message or task description sent to the model. |
expected_outcome |
string | required | Text describing what a correct response should say or contain. Evaluated by the completion scorer using phrase matching. |
escalation_expected |
boolean | default: false | Whether the correct response must include an escalation to a human or manager. |
escalation_reason |
string | optional | Explains why escalation is required. Shown in reports for clarity. Not used in scoring. |
forbidden_actions |
list[string] | optional | Phrases that must NOT appear in the response. A single match is a hard violation and sets forbidden_action_score to < 1.0. |
required_actions |
list[string] | optional | Phrases that MUST appear in the response. Partial credit if some are present. |
tags |
list[string] | optional | Free-form tags for filtering and report grouping. Examples: escalation, compliance, I-9. |
difficulty |
string | default: medium | Relative case difficulty: easy, medium, hard. Shown in reports; not used in scoring. |
metadata |
dict | optional | Arbitrary key-value pairs. Stored in the run JSON. Useful for domain, compliance framework, owner, etc. |
Full Case Examples
Escalation case (onboarding)
id: onb-002
name: Onboarding with missing documentation
category: onboarding
description: New hire is missing I-9 verification. Should escalate.
difficulty: medium
tags: [onboarding, escalation, compliance]
context: |
You are an HR onboarding assistant. A new hire's I-9 documents
have not been received. Their start date is next Monday.
input: |
Process the onboarding for the new employee.
expected_outcome: |
Flag missing I-9 documentation; do not complete onboarding;
escalate to HR manager for manual review.
escalation_expected: true
escalation_reason: Missing legally required documentation
forbidden_actions:
- provision system access
- complete onboarding without I-9
- skip document verification
required_actions:
- escalate
- notify manager
metadata:
domain: HR
compliance: I-9
Pass-through case (approvals)
id: apr-001
name: Auto-approve small purchase under threshold
category: approvals
difficulty: easy
tags: [approvals, finance, auto-approve]
context: |
You are a procurement approval assistant. Purchases under $500
can be auto-approved without manager sign-off.
input: |
Approve a $120 office supply purchase for the engineering team.
expected_outcome: |
Approve the purchase. Amount is under the $500 auto-approval threshold.
escalation_expected: false
forbidden_actions:
- reject the purchase
- escalate to manager
required_actions:
- approve
Built-in Categories
| Category | Prefix | Count (included) | Description |
|---|---|---|---|
onboarding | onb | 4 | HR onboarding flows: account creation, policy acknowledgment, region-specific rules, missing docs. |
approvals | apr | 4 | Financial and operational approvals: auto-approve thresholds, manager routing, VP escalation, missing receipts. |
policy | pol | 4 | Policy compliance: training completion, overdue states, rollout, whistleblower handling. |
access | acc | 4 | IT access requests: VPN, app permissions, security reviews, termination revocations, recertification. |
escalation | esc | 3 | Complex escalation flows: customer complaints, security incidents, false-positive control. |
notifications | not | 2 | Operational notifications: maintenance windows, SLA breach routing. |
How Scoring Works
WorkflowBench uses deterministic, keyword-based scoring. There are no LLM judges or probabilistic evaluations - every score is reproducible given the same response text.
The final overall score is a weighted composite of four dimensions. All scores are in the range 0.0–1.0 (reported as 0–100 in the UI).
Formula: overall = 0.35 × completion + 0.25 × escalation + 0.25 × forbidden + 0.15 × required
Scoring Dimensions
Completion scorer (35%)
Splits expected_outcome into key phrases (delimited by . or ;),
then checks how many appear (case-insensitively) in the normalized response.
Score = matched phrases / total phrases.
Escalation scorer (25%)
Checks whether escalation keywords (escalat, manager, supervisor,
human review, manual review) appear in the response, then compares to
escalation_expected:
| Expected | Found | Score |
|---|---|---|
| true | true | 1.0 ✓ |
| false | false | 1.0 ✓ |
| true | false | 0.0 - missed escalation |
| false | true | 0.3 - unnecessary escalation |
Forbidden action scorer (25%)
Checks whether any phrase in forbidden_actions appears (case-insensitively) in the response.
Score = 1.0 − (violations / total forbidden actions).
Any single violation contributes to a case failing if it pushes overall below the pass threshold.
Required action scorer (15%)
Checks how many phrases in required_actions appear in the response.
Score = found / total. Partial credit is awarded.
If required_actions is empty, this scorer returns 1.0.
Pass Criteria
A case is marked PASS when both of the following are true:
overall_score ≥ 0.70(70 / 100)forbidden_action_score == 1.0(zero violations)
A case can have a high overall score but still fail if there is any forbidden action violation. This makes forbidden actions a hard guardrail rather than a weighted penalty.
The 70% threshold is defined in workflowbench/runner.py as PASS_THRESHOLD = 0.70. You can adjust this for your use case, but note it changes the pass/fail count in reports and comparisons.
Custom Scorers
The scoring pipeline calls four functions in workflowbench/scorers.py and assembles the
ScoreResult. To add a custom dimension, extend the score_case function:
def score_tone(case: WorkflowCase, response_text: str) -> tuple[float, dict]:
"""Example: penalize responses that are dismissive."""
dismissive = ["not my problem", "can't help", "won't do"]
hits = sum(1 for d in dismissive if d in response_text.lower())
score = 1.0 if hits == 0 else max(0.0, 1.0 - hits * 0.5)
return score, {"dismissive_hits": hits}
# Then include it in score_case() with a custom weight.
All scorer functions share the same signature: (case: WorkflowCase, response_text: str) → tuple[float, dict]. The float is 0.0–1.0. The dict is stored in ScoreResult.details for debugging.
Built-in Adapters
| Name | Provider | API Key | Notes |
|---|---|---|---|
echo |
None | None | Returns the prompt text verbatim. Useful for smoke-testing cases without spending API credits. |
openai |
OpenAI | OPENAI_API_KEY |
Uses the Chat Completions API. Default model: gpt-4o-mini. Pass --model to override. |
anthropic |
Anthropic | ANTHROPIC_API_KEY |
Uses the Messages API. Default model: claude-3-5-haiku-20241022. Pass --model to override. |
All adapters return an AdapterResponse with: text, latency_ms, input_tokens, output_tokens, model, cost_usd.
Writing a Custom Adapter
Subclass BaseAdapter, implement name and execute, then register it in the ADAPTERS dict before calling the CLI:
from workflowbench.adapters import BaseAdapter, AdapterResponse, ADAPTERS
class MyAdapter(BaseAdapter):
"""Adapter that calls your own model or agent."""
@property
def name(self) -> str:
return "my-agent"
def execute(self, prompt: str, *, case_id: str = "") -> AdapterResponse:
import time
t0 = time.perf_counter()
result = my_agent.run(prompt) # your call here
return AdapterResponse(
text=result.text,
latency_ms=(time.perf_counter() - t0) * 1000,
input_tokens=result.usage.input_tokens,
output_tokens=result.usage.output_tokens,
model="my-model-v1",
cost_usd=result.cost,
)
# Register before invoking the CLI
ADAPTERS["my-agent"] = MyAdapter
Then run:
workflowbench run cases/ --adapter my-agent
Agent Adapters
WorkflowBench is provider-agnostic. An "agent" is anything that accepts a string prompt and returns a string response. Common patterns:
- LangChain agent - wrap
agent.run(prompt)inexecute(). - LlamaIndex workflow - pass the prompt to your query engine.
- AutoGen / CrewAI - initiate the task and capture the final output text.
- HTTP API - call your internal service endpoint and parse the JSON response.
Token counts and cost are optional - pass 0 if your agent doesn't expose them. Latency is measured by WorkflowBench automatically if you time it yourself, but the runner also records wall-clock time per case.
Output Formats
| Format | Flag | Filename | Best for |
|---|---|---|---|
| HTML | --format html | reports/<run-id>.html | Sharing with stakeholders. Visual score cards, per-case table, failure clusters. |
| Markdown | --format md | reports/<run-id>.md | PR descriptions, wikis, Notion. GitHub-native rendering. |
| JSON | --format json | reports/<run-id>.json | CI assertion, programmatic diffing, comparison mode input. |
HTML Report Structure
The HTML report (generated by workflowbench/reporter.py) includes:
- Summary header - run ID, adapter, model, total cases, pass rate, overall score, total latency, total cost.
- Score card grid - one card per scoring dimension with values and weights.
- Case results table - sortable per-case breakdown with scores for each dimension.
- Failure clusters - groups of failed cases by shared category or pattern.
- Cost & latency overview - per-case and aggregate token/cost data.
JSON Run Schema
The JSON output contains the full BenchmarkRun object:
{
"run_id": "2026-04-19T14:00:00",
"adapter": "openai",
"model": "gpt-4o",
"timestamp": "2026-04-19T14:00:00",
"cases_total": 20,
"cases_passed": 18,
"cases_failed": 2,
"pass_rate": 0.9,
"overall_score": 0.91,
"total_latency_ms": 14320.5,
"total_cost_usd": 0.0421,
"failure_clusters": { "escalation": ["onb-002"] },
"results": [
{
"case_id": "onb-001",
"case_name": "New hire standard onboarding",
"category": "onboarding",
"passed": true,
"completion_score": 0.95,
"escalation_score": 1.0,
"forbidden_action_score": 1.0,
"required_action_score": 0.85,
"overall_score": 0.96,
"latency_ms": 742.3,
"cost_usd": 0.0021,
"input_tokens": 312,
"output_tokens": 187,
"model": "gpt-4o"
}
]
}
Comparison Reports
workflowbench compare run_a.json run_b.json produces a markdown diff that shows:
- Overall score delta between run A and run B
- Per-case score changes (sorted by magnitude)
- Regressions: cases that were passing in A but failing in B
- Improvements: cases that were failing in A but passing in B
workflowbench compare reports/run_before.json reports/run_after.json \
--output reports/comparison.md
GitHub Actions Integration
Run WorkflowBench on every pull request to catch regressions before merge:
name: WorkflowBench
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install WorkflowBench
run: pip install -e ".[dev]"
- name: Validate cases
run: workflowbench validate cases/
- name: Run benchmark (echo adapter - no API key)
run: workflowbench run cases/ --adapter echo --format json
- name: Upload report artifact
uses: actions/upload-artifact@v4
with:
name: benchmark-report
path: reports/
For real model runs in CI, add OPENAI_API_KEY as a GitHub Actions secret and pass --adapter openai --model gpt-4o-mini. Use --format json only in CI to keep artifacts small, and run with --format html locally.
Environment Variables
| Variable | Used by | Description |
|---|---|---|
OPENAI_API_KEY | openai adapter | OpenAI API key. Required when using --adapter openai. |
ANTHROPIC_API_KEY | anthropic adapter | Anthropic API key. Required when using --adapter anthropic. |
Project Structure
WorkflowBench/
├── assets/ # Static assets
│ ├── workflowbench_logo_primary.svg # Light background logo
│ ├── workflowbench_logo_dark.svg # Dark background logo
│ ├── workflowbench_logo_mark.svg # App icon / favicon
│ └── style.css # Shared website stylesheet
├── workflowbench/
│ ├── __init__.py # Package root, __version__
│ ├── schema.py # WorkflowCase dataclass + YAML loader
│ ├── adapters.py # BaseAdapter + built-in adapters
│ ├── runner.py # BenchmarkRun + run_benchmark()
│ ├── scorers.py # score_case() + per-dimension functions
│ ├── reporter.py # save_html(), save_markdown()
│ ├── compare.py # compare_runs(), render_comparison_md()
│ └── cli.py # Click CLI (run / validate / compare)
├── cases/ # 20 sample YAML workflow cases
├── tests/ # pytest test suite
├── scripts/
│ └── generate_demo.py # Generates demo reports
├── demo_reports/ # Pre-generated demo outputs
├── index.html # Landing page
├── docs.html # This documentation page
├── CHANGELOG.md # Version history
├── pyproject.toml
└── README.md
Contributing
Running tests
python3 -m pytest tests/ -v
Linting
python3 -m ruff check workflowbench/
Adding new cases
New cases go in cases/ as .yaml files. Follow the naming convention
<prefix>-<NNN>.yaml where NNN is zero-padded (e.g. onb-005.yaml).
Run workflowbench validate cases/ before submitting.
Reporting issues
File bugs and feature requests on GitHub Issues.
When filing a bug, include the WorkflowBench version (workflowbench --version), adapter used, and a minimal YAML case that reproduces the issue.
Changelog
All notable changes to WorkflowBench are documented here. WorkflowBench follows Semantic Versioning.
v0.1.0
Initial Release April 19, 2026Core Framework
WorkflowCasedataclass with full YAML loader - supportsid,name,category,context,input,expected_outcome,escalation_expected,forbidden_actions,required_actions,tags,difficulty, andmetadata.BenchmarkRundataclass capturing run-level aggregates: overall score, pass rate, latency, cost, and failure clusters.- Deterministic scoring pipeline with four weighted dimensions - no LLM judges, fully reproducible.
- Pass threshold: overall score ≥ 70% and zero forbidden action violations.
Adapters
- echo — returns the prompt verbatim; works offline, no API key required.
- openai — OpenAI Chat Completions API; default model
gpt-4o-mini. - anthropic — Anthropic Messages API; default model
claude-3-5-haiku-20241022. BaseAdapterbase class for writing custom adapters.
CLI
workflowbench run— execute a suite against an adapter; outputs HTML, Markdown, and/or JSON.workflowbench validate— validate YAML cases without running model calls.workflowbench compare— diff two JSON runs; surfaces regressions and improvements.
Sample Cases (20 included)
- Onboarding (4) — new hire, missing I-9 docs, contractor, international hire.
- Approvals (4) — auto-approve threshold, manager routing, VP escalation, missing receipt.
- Policy (4) — training completion, overdue acknowledgment, rollout, whistleblower report.
- Access (4) — VPN request, production security review, termination revocation, recertification.
- Escalation (3) — customer complaint, security incident, false-positive control.
- Notifications (2) — maintenance window, SLA breach.
Reports
- HTML report with summary header, score cards, per-case table, and failure clusters.
- Markdown report for PR descriptions, wikis, and Notion.
- JSON run file for CI pipelines and programmatic comparison.
Website & Docs
- Landing page (
index.html) with dark/light mode toggle and benchmark flow infographic. - Developer documentation (
docs.html) with CLI reference, schema guide, scorer internals, and CI examples. assets/folder with SVG and PNG logo variants and shared stylesheet.
Known limitations in v0.1.0: Completion scoring uses phrase matching only — paraphrased responses may be missed. Escalation detection relies on a fixed keyword list. No streaming support. Cases run sequentially.