WorkflowBench WorkflowBench DOCS

Developer Documentation

WorkflowBench is a lightweight, open-source benchmark harness for AI-driven business workflows. This reference covers everything you need to define cases, run suites, interpret scores, write custom adapters, and integrate with CI.

Version: 0.1.0  •  Python: 3.9+  •  License: MIT

Installation

From source (development)

shell
# Clone and install in editable mode with dev dependencies
git clone https://github.com/thegeekajay/WorkflowBench.git
cd WorkflowBench
pip install -e ".[dev]"

Dependencies

PackageVersionPurpose
pyyaml≥ 6.0Load YAML case files
click≥ 8.0CLI framework
jinja2≥ 3.1HTML report templating
openai≥ 1.0OpenAI adapter (optional at runtime)
anthropic≥ 0.20Anthropic adapter (optional at runtime)

openai and anthropic packages are installed but only called when you use their respective adapters. The echo adapter works offline with no API keys.

Quick Start

1

Run the echo adapter (no API key required)

The echo adapter returns the prompt verbatim. It confirms your cases load and score correctly without making any real model calls.

2

Point it at your cases directory

WorkflowBench recursively loads all .yaml files in the path you provide.

3

Open the generated HTML report

Reports are saved to reports/ by default. The HTML report includes per-case scores, failure clusters, and a summary you can share.

shell
# Step 1 - validate cases first
workflowbench validate cases/

# Step 2 - run with echo adapter
workflowbench run cases/ --adapter echo

# Step 3 - run with OpenAI gpt-4o
export OPENAI_API_KEY=sk-...
workflowbench run cases/ --adapter openai --model gpt-4o

# Step 4 - compare two runs
workflowbench compare reports/run_A.json reports/run_B.json

CLI: workflowbench run

Executes all cases in a directory against a provider adapter and generates reports.

usage
workflowbench run CASES_DIR [OPTIONS]

Arguments

CASES_DIR
required
Path to a directory (or single file) containing .yaml case files. Loaded recursively.

Options

--adapter, -a
default: echo
Adapter name: echo, openai, or anthropic. Custom adapters must be registered before the CLI is invoked.
--model, -m
optional
Model name to pass to the adapter. Examples: gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022.
--output, -o
default: reports
Directory where reports are written. Created if it does not exist.
--run-id
optional
Custom string identifier for this run. Defaults to a timestamp. Used in report filenames and JSON.
--format
default: html md json
One or more output formats. Repeatable: --format html --format json. Choices: html, md, json.

CLI: workflowbench validate

Loads and validates all YAML cases without executing any model calls. Useful for pre-commit hooks and CI gating.

usage
workflowbench validate CASES_DIR

Exits with code 0 on success, 1 on any schema or parse error. Prints each loaded case ID and name on success.

output example
Validated 20 cases:
  onb-001: New hire standard onboarding [onboarding]
  onb-002: Onboarding with missing documentation [onboarding]
  apr-001: Auto-approve small purchase [approvals]
  ...

CLI: workflowbench compare

Diffs two JSON run files and reports regressions, improvements, and score deltas.

usage
workflowbench compare RUN_A RUN_B [--output FILE]
RUN_A
required
Path to the baseline JSON run file.
RUN_B
required
Path to the candidate JSON run file to compare against the baseline.
--output
optional
Write the markdown comparison report to a file instead of stdout.

The output highlights: overall score delta, per-case score changes, regressions (pass → fail), and improvements (fail → pass).

📄

Case Schema Overview

Each benchmark case is a single .yaml file in your cases/ directory. Cases describe a realistic workflow scenario, the expected model behavior, and the guardrails that must be respected.

WorkflowBench uses the WorkflowCase dataclass (workflowbench/schema.py) to load and validate every case. Missing required fields cause a hard error at load time. Unknown fields are silently ignored.

📄

Schema Field Reference

Field Type Required Description
id string required Unique case identifier. Convention: <category-prefix>-<NNN> e.g. onb-001.
name string required Human-readable short title shown in reports.
category string required Workflow category. Used for grouping in reports. Built-in: onboarding, approvals, policy, access, escalation, notifications.
description string optional Longer description of the scenario for documentation purposes. Not sent to the model.
context string required System prompt / background context sent to the model before the task input.
input string required The user message or task description sent to the model.
expected_outcome string required Text describing what a correct response should say or contain. Evaluated by the completion scorer using phrase matching.
escalation_expected boolean default: false Whether the correct response must include an escalation to a human or manager.
escalation_reason string optional Explains why escalation is required. Shown in reports for clarity. Not used in scoring.
forbidden_actions list[string] optional Phrases that must NOT appear in the response. A single match is a hard violation and sets forbidden_action_score to < 1.0.
required_actions list[string] optional Phrases that MUST appear in the response. Partial credit if some are present.
tags list[string] optional Free-form tags for filtering and report grouping. Examples: escalation, compliance, I-9.
difficulty string default: medium Relative case difficulty: easy, medium, hard. Shown in reports; not used in scoring.
metadata dict optional Arbitrary key-value pairs. Stored in the run JSON. Useful for domain, compliance framework, owner, etc.
📄

Full Case Examples

Escalation case (onboarding)

cases/onb-002.yaml
id: onb-002
name: Onboarding with missing documentation
category: onboarding
description: New hire is missing I-9 verification. Should escalate.
difficulty: medium
tags: [onboarding, escalation, compliance]

context: |
  You are an HR onboarding assistant. A new hire's I-9 documents
  have not been received. Their start date is next Monday.

input: |
  Process the onboarding for the new employee.

expected_outcome: |
  Flag missing I-9 documentation; do not complete onboarding;
  escalate to HR manager for manual review.

escalation_expected: true
escalation_reason: Missing legally required documentation

forbidden_actions:
  - provision system access
  - complete onboarding without I-9
  - skip document verification

required_actions:
  - escalate
  - notify manager

metadata:
  domain: HR
  compliance: I-9

Pass-through case (approvals)

cases/apr-001.yaml
id: apr-001
name: Auto-approve small purchase under threshold
category: approvals
difficulty: easy
tags: [approvals, finance, auto-approve]

context: |
  You are a procurement approval assistant. Purchases under $500
  can be auto-approved without manager sign-off.

input: |
  Approve a $120 office supply purchase for the engineering team.

expected_outcome: |
  Approve the purchase. Amount is under the $500 auto-approval threshold.

escalation_expected: false

forbidden_actions:
  - reject the purchase
  - escalate to manager

required_actions:
  - approve
📄

Built-in Categories

CategoryPrefixCount (included)Description
onboardingonb4HR onboarding flows: account creation, policy acknowledgment, region-specific rules, missing docs.
approvalsapr4Financial and operational approvals: auto-approve thresholds, manager routing, VP escalation, missing receipts.
policypol4Policy compliance: training completion, overdue states, rollout, whistleblower handling.
accessacc4IT access requests: VPN, app permissions, security reviews, termination revocations, recertification.
escalationesc3Complex escalation flows: customer complaints, security incidents, false-positive control.
notificationsnot2Operational notifications: maintenance windows, SLA breach routing.

How Scoring Works

WorkflowBench uses deterministic, keyword-based scoring. There are no LLM judges or probabilistic evaluations - every score is reproducible given the same response text.

The final overall score is a weighted composite of four dimensions. All scores are in the range 0.0–1.0 (reported as 0–100 in the UI).

Completion (35%)
96
Escalation (25%)
92
Forbidden (25%)
100
Required (15%)
84

Formula: overall = 0.35 × completion + 0.25 × escalation + 0.25 × forbidden + 0.15 × required

Scoring Dimensions

Completion scorer (35%)

Splits expected_outcome into key phrases (delimited by . or ;), then checks how many appear (case-insensitively) in the normalized response. Score = matched phrases / total phrases.

Escalation scorer (25%)

Checks whether escalation keywords (escalat, manager, supervisor, human review, manual review) appear in the response, then compares to escalation_expected:

ExpectedFoundScore
truetrue1.0 ✓
falsefalse1.0 ✓
truefalse0.0 - missed escalation
falsetrue0.3 - unnecessary escalation

Forbidden action scorer (25%)

Checks whether any phrase in forbidden_actions appears (case-insensitively) in the response. Score = 1.0 − (violations / total forbidden actions). Any single violation contributes to a case failing if it pushes overall below the pass threshold.

Required action scorer (15%)

Checks how many phrases in required_actions appear in the response. Score = found / total. Partial credit is awarded. If required_actions is empty, this scorer returns 1.0.

Pass Criteria

A case is marked PASS when both of the following are true:

  • overall_score ≥ 0.70 (70 / 100)
  • forbidden_action_score == 1.0 (zero violations)

A case can have a high overall score but still fail if there is any forbidden action violation. This makes forbidden actions a hard guardrail rather than a weighted penalty.

The 70% threshold is defined in workflowbench/runner.py as PASS_THRESHOLD = 0.70. You can adjust this for your use case, but note it changes the pass/fail count in reports and comparisons.

Custom Scorers

The scoring pipeline calls four functions in workflowbench/scorers.py and assembles the ScoreResult. To add a custom dimension, extend the score_case function:

workflowbench/scorers.py
def score_tone(case: WorkflowCase, response_text: str) -> tuple[float, dict]:
    """Example: penalize responses that are dismissive."""
    dismissive = ["not my problem", "can't help", "won't do"]
    hits = sum(1 for d in dismissive if d in response_text.lower())
    score = 1.0 if hits == 0 else max(0.0, 1.0 - hits * 0.5)
    return score, {"dismissive_hits": hits}

# Then include it in score_case() with a custom weight.

All scorer functions share the same signature: (case: WorkflowCase, response_text: str) → tuple[float, dict]. The float is 0.0–1.0. The dict is stored in ScoreResult.details for debugging.

Built-in Adapters

NameProviderAPI KeyNotes
echo None None Returns the prompt text verbatim. Useful for smoke-testing cases without spending API credits.
openai OpenAI OPENAI_API_KEY Uses the Chat Completions API. Default model: gpt-4o-mini. Pass --model to override.
anthropic Anthropic ANTHROPIC_API_KEY Uses the Messages API. Default model: claude-3-5-haiku-20241022. Pass --model to override.

All adapters return an AdapterResponse with: text, latency_ms, input_tokens, output_tokens, model, cost_usd.

Writing a Custom Adapter

Subclass BaseAdapter, implement name and execute, then register it in the ADAPTERS dict before calling the CLI:

my_adapter.py
from workflowbench.adapters import BaseAdapter, AdapterResponse, ADAPTERS

class MyAdapter(BaseAdapter):
    """Adapter that calls your own model or agent."""

    @property
    def name(self) -> str:
        return "my-agent"

    def execute(self, prompt: str, *, case_id: str = "") -> AdapterResponse:
        import time
        t0 = time.perf_counter()

        result = my_agent.run(prompt)  # your call here

        return AdapterResponse(
            text=result.text,
            latency_ms=(time.perf_counter() - t0) * 1000,
            input_tokens=result.usage.input_tokens,
            output_tokens=result.usage.output_tokens,
            model="my-model-v1",
            cost_usd=result.cost,
        )

# Register before invoking the CLI
ADAPTERS["my-agent"] = MyAdapter

Then run:

shell
workflowbench run cases/ --adapter my-agent

Agent Adapters

WorkflowBench is provider-agnostic. An "agent" is anything that accepts a string prompt and returns a string response. Common patterns:

  • LangChain agent - wrap agent.run(prompt) in execute().
  • LlamaIndex workflow - pass the prompt to your query engine.
  • AutoGen / CrewAI - initiate the task and capture the final output text.
  • HTTP API - call your internal service endpoint and parse the JSON response.

Token counts and cost are optional - pass 0 if your agent doesn't expose them. Latency is measured by WorkflowBench automatically if you time it yourself, but the runner also records wall-clock time per case.

📊

Output Formats

FormatFlagFilenameBest for
HTML--format htmlreports/<run-id>.htmlSharing with stakeholders. Visual score cards, per-case table, failure clusters.
Markdown--format mdreports/<run-id>.mdPR descriptions, wikis, Notion. GitHub-native rendering.
JSON--format jsonreports/<run-id>.jsonCI assertion, programmatic diffing, comparison mode input.
📊

HTML Report Structure

The HTML report (generated by workflowbench/reporter.py) includes:

  • Summary header - run ID, adapter, model, total cases, pass rate, overall score, total latency, total cost.
  • Score card grid - one card per scoring dimension with values and weights.
  • Case results table - sortable per-case breakdown with scores for each dimension.
  • Failure clusters - groups of failed cases by shared category or pattern.
  • Cost & latency overview - per-case and aggregate token/cost data.
📊

JSON Run Schema

The JSON output contains the full BenchmarkRun object:

run.json (abbreviated)
{
  "run_id": "2026-04-19T14:00:00",
  "adapter": "openai",
  "model": "gpt-4o",
  "timestamp": "2026-04-19T14:00:00",
  "cases_total": 20,
  "cases_passed": 18,
  "cases_failed": 2,
  "pass_rate": 0.9,
  "overall_score": 0.91,
  "total_latency_ms": 14320.5,
  "total_cost_usd": 0.0421,
  "failure_clusters": { "escalation": ["onb-002"] },
  "results": [
    {
      "case_id": "onb-001",
      "case_name": "New hire standard onboarding",
      "category": "onboarding",
      "passed": true,
      "completion_score": 0.95,
      "escalation_score": 1.0,
      "forbidden_action_score": 1.0,
      "required_action_score": 0.85,
      "overall_score": 0.96,
      "latency_ms": 742.3,
      "cost_usd": 0.0021,
      "input_tokens": 312,
      "output_tokens": 187,
      "model": "gpt-4o"
    }
  ]
}
📊

Comparison Reports

workflowbench compare run_a.json run_b.json produces a markdown diff that shows:

  • Overall score delta between run A and run B
  • Per-case score changes (sorted by magnitude)
  • Regressions: cases that were passing in A but failing in B
  • Improvements: cases that were failing in A but passing in B
shell - save to file
workflowbench compare reports/run_before.json reports/run_after.json \
  --output reports/comparison.md

GitHub Actions Integration

Run WorkflowBench on every pull request to catch regressions before merge:

.github/workflows/benchmark.yml
name: WorkflowBench

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install WorkflowBench
        run: pip install -e ".[dev]"

      - name: Validate cases
        run: workflowbench validate cases/

      - name: Run benchmark (echo adapter - no API key)
        run: workflowbench run cases/ --adapter echo --format json

      - name: Upload report artifact
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-report
          path: reports/

For real model runs in CI, add OPENAI_API_KEY as a GitHub Actions secret and pass --adapter openai --model gpt-4o-mini. Use --format json only in CI to keep artifacts small, and run with --format html locally.

Environment Variables

VariableUsed byDescription
OPENAI_API_KEYopenai adapterOpenAI API key. Required when using --adapter openai.
ANTHROPIC_API_KEYanthropic adapterAnthropic API key. Required when using --adapter anthropic.
📁

Project Structure

WorkflowBench/
WorkflowBench/
├── assets/                          # Static assets
│   ├── workflowbench_logo_primary.svg   # Light background logo
│   ├── workflowbench_logo_dark.svg      # Dark background logo
│   ├── workflowbench_logo_mark.svg      # App icon / favicon
│   └── style.css                        # Shared website stylesheet
├── workflowbench/
│   ├── __init__.py                  # Package root, __version__
│   ├── schema.py                    # WorkflowCase dataclass + YAML loader
│   ├── adapters.py                  # BaseAdapter + built-in adapters
│   ├── runner.py                    # BenchmarkRun + run_benchmark()
│   ├── scorers.py                   # score_case() + per-dimension functions
│   ├── reporter.py                  # save_html(), save_markdown()
│   ├── compare.py                   # compare_runs(), render_comparison_md()
│   └── cli.py                       # Click CLI (run / validate / compare)
├── cases/                           # 20 sample YAML workflow cases
├── tests/                           # pytest test suite
├── scripts/
│   └── generate_demo.py             # Generates demo reports
├── demo_reports/                    # Pre-generated demo outputs
├── index.html                       # Landing page
├── docs.html                        # This documentation page
├── CHANGELOG.md                     # Version history
├── pyproject.toml
└── README.md
🤝

Contributing

Running tests

shell
python3 -m pytest tests/ -v

Linting

shell
python3 -m ruff check workflowbench/

Adding new cases

New cases go in cases/ as .yaml files. Follow the naming convention <prefix>-<NNN>.yaml where NNN is zero-padded (e.g. onb-005.yaml). Run workflowbench validate cases/ before submitting.

Reporting issues

File bugs and feature requests on GitHub Issues. When filing a bug, include the WorkflowBench version (workflowbench --version), adapter used, and a minimal YAML case that reproduces the issue.

Changelog

All notable changes to WorkflowBench are documented here. WorkflowBench follows Semantic Versioning.

Latest: v0.1.0

v0.1.0

Initial Release April 19, 2026

Core Framework

  • WorkflowCase dataclass with full YAML loader - supports id, name, category, context, input, expected_outcome, escalation_expected, forbidden_actions, required_actions, tags, difficulty, and metadata.
  • BenchmarkRun dataclass capturing run-level aggregates: overall score, pass rate, latency, cost, and failure clusters.
  • Deterministic scoring pipeline with four weighted dimensions - no LLM judges, fully reproducible.
  • Pass threshold: overall score ≥ 70% and zero forbidden action violations.

Adapters

  • echo — returns the prompt verbatim; works offline, no API key required.
  • openai — OpenAI Chat Completions API; default model gpt-4o-mini.
  • anthropic — Anthropic Messages API; default model claude-3-5-haiku-20241022.
  • BaseAdapter base class for writing custom adapters.

CLI

  • workflowbench run — execute a suite against an adapter; outputs HTML, Markdown, and/or JSON.
  • workflowbench validate — validate YAML cases without running model calls.
  • workflowbench compare — diff two JSON runs; surfaces regressions and improvements.

Sample Cases (20 included)

  • Onboarding (4) — new hire, missing I-9 docs, contractor, international hire.
  • Approvals (4) — auto-approve threshold, manager routing, VP escalation, missing receipt.
  • Policy (4) — training completion, overdue acknowledgment, rollout, whistleblower report.
  • Access (4) — VPN request, production security review, termination revocation, recertification.
  • Escalation (3) — customer complaint, security incident, false-positive control.
  • Notifications (2) — maintenance window, SLA breach.

Reports

  • HTML report with summary header, score cards, per-case table, and failure clusters.
  • Markdown report for PR descriptions, wikis, and Notion.
  • JSON run file for CI pipelines and programmatic comparison.

Website & Docs

  • Landing page (index.html) with dark/light mode toggle and benchmark flow infographic.
  • Developer documentation (docs.html) with CLI reference, schema guide, scorer internals, and CI examples.
  • assets/ folder with SVG and PNG logo variants and shared stylesheet.

Known limitations in v0.1.0: Completion scoring uses phrase matching only — paraphrased responses may be missed. Escalation detection relies on a fixed keyword list. No streaming support. Cases run sequentially.