Open-source benchmark harness

Test AI workflows
before they break
in production.

Define realistic YAML scenarios. Run prompts or agents. Score correctness, escalation, compliance, latency, and cost - all before rollout.

0 Enterprise cases
0 Scoring signals
0 Report formats
workflowbench
$ workflowbench run cases/ --adapter openai
onboarding/new_hire PASS
approvals/finance_capex PASS
access/vpn_request ESCALATE
policy/ack_overdue PASS
notifications/payroll PASS
━━━ Score: 91/100 · 5 cases · 0.42s
Why teams need this

Demo success is not
launch confidence.

WorkflowBench is built for teams shipping AI into business processes where edge cases, escalation paths, and policy compliance matter.

Demos hide failure modes

Happy-path demos look impressive. Proving correct behavior across real edge cases is much harder.

No common scorecard

Engineering managers and architects need the same benchmark output to judge readiness and compare versions.

Reliability should be reusable

Open-source cases, deterministic scorers, and local-first reports make validation accessible to any team.

How it works

Four steps from YAML
to launch confidence.

Real scenarios go in. Deterministic scores come out. Leaders get a report they can trust.

1

Define YAML scenarios

Onboarding Approvals Policy Access Notifications
cases/onb-001.yaml
id: onb-001
name: New hire onboarding
category: onboarding
input:
  department: Finance
  region: EU
rules:
  - manager_approval
  - policy_ack_required
expected:
  outcome: account_created
  escalate: false
2

Run against any provider

OpenAI Anthropic Echo Custom Agent
Terminal
$ workflowbench run cases/ --adapter openai
onboarding/new_hire PASS
approvals/finance_capex PASS
access/vpn_request ESCALATE
policy/ack_overdue PASS
notifications/payroll PASS
3

Deterministic scoring

0
Correctness
0
Escalation
0
Compliance
0
Latency
0
Cost
4

Shareable reports

HTML Markdown JSON
Overall Score
0/ 100
Failure Clusters
Escalation
Latency
Compliance
Compare Runs
prompt-v1
76
prompt-v2
91
Production-style coverage

Enterprise workflows that
feel real, not synthetic.

Common business processes where correctness and escalation behavior matter more than chatbot polish.

HR

Onboarding

Identity verification, account creation, approval routing, region-specific policy steps.

AP

Approvals

Financial thresholds, manager routing, escalation logic for ambiguous approvals.

PL

Policy

Compliance requirements, reminder cadence, overdue handling, forbidden action checks.

IT

Access

VPN & app access grants, risky request denial, manager approval gating.

NT

Notifications

Stakeholder routing, timing constraints, message correctness across channels.

Signals that matter

Scoring for engineers
and managers.

Deterministic evaluation so benchmark results stay reproducible and actionable.

Correctness96%
Escalation92%
Compliance100%
Latency84%
Cost68%
What ships

Local-first, small-team friendly, easy to extend.

YAML Scenarios

Readable format with expected outcomes, escalation rules, and forbidden actions.

One CLI Command

Benchmark a full suite locally and generate a report in a single run.

Provider-Agnostic

Swap prompts, providers, or agent entrypoints without rewriting tests.

Shareable Reports

HTML, Markdown, JSON - engineers get detail, leaders get the story.

Open source

A reliability signal for teams
building real AI systems.

A lightweight benchmark harness - not a full eval platform. Prove quality before rollout, compare changes over time.

01

Engineering depth, not just prompting

Systems thinking, launch quality, and operational judgment - beyond a simple chatbot demo.

02

Makes launch risk discussable

Reports expose failure clusters, comparison runs, and workflow regressions in a manager-friendly format.

03

No hosted platform needed

Define cases locally, run on your own adapters, produce reports that are easy to share.

Stop shipping AI workflows
on vibes.

WorkflowBench runs locally, needs no hosted service, and produces reports you can share with your entire team.