Demos hide failure modes
Happy-path demos look impressive. Proving correct behavior across real edge cases is much harder.
Define realistic YAML scenarios. Run prompts or agents. Score correctness, escalation, compliance, latency, and cost - all before rollout.
WorkflowBench is built for teams shipping AI into business processes where edge cases, escalation paths, and policy compliance matter.
Happy-path demos look impressive. Proving correct behavior across real edge cases is much harder.
Engineering managers and architects need the same benchmark output to judge readiness and compare versions.
Open-source cases, deterministic scorers, and local-first reports make validation accessible to any team.
Real scenarios go in. Deterministic scores come out. Leaders get a report they can trust.
id: onb-001
name: New hire onboarding
category: onboarding
input:
department: Finance
region: EU
rules:
- manager_approval
- policy_ack_required
expected:
outcome: account_created
escalate: false
Common business processes where correctness and escalation behavior matter more than chatbot polish.
Identity verification, account creation, approval routing, region-specific policy steps.
Financial thresholds, manager routing, escalation logic for ambiguous approvals.
Compliance requirements, reminder cadence, overdue handling, forbidden action checks.
VPN & app access grants, risky request denial, manager approval gating.
Stakeholder routing, timing constraints, message correctness across channels.
Deterministic evaluation so benchmark results stay reproducible and actionable.
Readable format with expected outcomes, escalation rules, and forbidden actions.
Benchmark a full suite locally and generate a report in a single run.
Swap prompts, providers, or agent entrypoints without rewriting tests.
HTML, Markdown, JSON - engineers get detail, leaders get the story.
A lightweight benchmark harness - not a full eval platform. Prove quality before rollout, compare changes over time.
Systems thinking, launch quality, and operational judgment - beyond a simple chatbot demo.
Reports expose failure clusters, comparison runs, and workflow regressions in a manager-friendly format.
Define cases locally, run on your own adapters, produce reports that are easy to share.
WorkflowBench runs locally, needs no hosted service, and produces reports you can share with your entire team.