Open-source benchmark harness

Test AI workflows
before they break
in production.

Define realistic YAML scenarios. Run prompts or agents. Score correctness, escalation, compliance, latency, and cost - all before rollout.

Get Started See How It Works View Sample Report

0 Enterprise cases

0 Scoring signals

0 Report formats

workflowbench

$ workflowbench run cases/ --adapter openai

✓ onboarding/new_hire PASS

✓ approvals/finance_capex PASS

⚠ access/vpn_request ESCALATE

✓ policy/ack_overdue PASS

✓ notifications/payroll PASS

━━━ Score: 91/100 · 5 cases · 0.42s

Why teams need this

Demo success is not
launch confidence.

WorkflowBench is built for teams shipping AI into business processes where edge cases, escalation paths, and policy compliance matter.

Demos hide failure modes

Happy-path demos look impressive. Proving correct behavior across real edge cases is much harder.

No common scorecard

Engineering managers and architects need the same benchmark output to judge readiness and compare versions.

Reliability should be reusable

Open-source cases, deterministic scorers, and local-first reports make validation accessible to any team.

How it works

Four steps from YAML
to launch confidence.

Real scenarios go in. Deterministic scores come out. Leaders get a report they can trust.

Define YAML scenarios

Onboarding Approvals Policy Access Notifications

cases/onb-001.yaml

id: onb-001
name: New hire onboarding
category: onboarding
input:
  department: Finance
  region: EU
rules:
  - manager_approval
  - policy_ack_required
expected:
  outcome: account_created
  escalate: false

Run against any provider

OpenAI Anthropic Echo Custom Agent

Terminal

$ workflowbench run cases/ --adapter openai
✓ onboarding/new_hire PASS
✓ approvals/finance_capex PASS
⚠ access/vpn_request ESCALATE
✓ policy/ack_overdue PASS
✓ notifications/payroll PASS

Deterministic scoring

Correctness

Escalation

Compliance

Latency

Cost

Shareable reports

HTML Markdown JSON

Overall Score

0/ 100

Failure Clusters

Escalation

Latency

Compliance

Compare Runs

prompt-v1

prompt-v2

Production-style coverage

Enterprise workflows that
feel real, not synthetic.

Common business processes where correctness and escalation behavior matter more than chatbot polish.

Onboarding

Identity verification, account creation, approval routing, region-specific policy steps.

Approvals

Financial thresholds, manager routing, escalation logic for ambiguous approvals.

Policy

Compliance requirements, reminder cadence, overdue handling, forbidden action checks.

Access

VPN & app access grants, risky request denial, manager approval gating.

Notifications

Stakeholder routing, timing constraints, message correctness across channels.

Signals that matter

Scoring for engineers
and managers.

Deterministic evaluation so benchmark results stay reproducible and actionable.

Correctness96%

Escalation92%

Compliance100%

Latency84%

Cost68%

What ships

Local-first, small-team friendly, easy to extend.

YAML Scenarios

Readable format with expected outcomes, escalation rules, and forbidden actions.

One CLI Command

Benchmark a full suite locally and generate a report in a single run.

Provider-Agnostic

Swap prompts, providers, or agent entrypoints without rewriting tests.

Shareable Reports

HTML, Markdown, JSON - engineers get detail, leaders get the story.

Open source

A reliability signal for teams
building real AI systems.

A lightweight benchmark harness - not a full eval platform. Prove quality before rollout, compare changes over time.

Engineering depth, not just prompting

Systems thinking, launch quality, and operational judgment - beyond a simple chatbot demo.

Makes launch risk discussable

Reports expose failure clusters, comparison runs, and workflow regressions in a manager-friendly format.

No hosted platform needed

Define cases locally, run on your own adapters, produce reports that are easy to share.

Stop shipping AI workflows
on vibes.

WorkflowBench runs locally, needs no hosted service, and produces reports you can share with your entire team.

View on GitHub Quick Start Guide

Test AI workflowsbefore they breakin production.

Demo success is notlaunch confidence.