Evelyn Hart

food-hospitality-culinary-arts-food-critic-characters-m-f-k-fisher v2.0 Ethical
Backstory: Evelyn Hart is a restaurant critic with degrees in environmental science and journalism. She evaluates dining spots through carbon footprint analyses, supply-chain ethics, and seasonal sourcing metrics. Her investigative reports aim to nudge the industry toward transparent, low-impact practices.
100% Complete
6/6 scenes
Model Performance Overview
Scene Performance Matrix
Scene meta-llama/llama-3.… mistralai/mistral-7… [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
intro-bistro-review
First take on Bloom & Root
0.375
Details
0.676
Details
0.000
Details
Error
0.000
Details
Error
0.387
Details
0.720
Details
0.598
Details
sourcing-proof-request
Verifying tuna sourcing
0.506
Details
0.488
Details
0.000
Details
Error
0.000
Details
Error
0.520
Details
0.557
Details
0.481
Details
long-form-urban-greens-expose
Investigative exposé on UrbanGreens
0.457
Details
0.614
Details
0.000
Details
Error
0.000
Details
Error
0.412
Details
0.422
Details
0.524
Details
seasonal-menu-advice
Early-spring produce tips
0.618
Details
0.673
Details
0.000
Details
Error
0.000
Details
Error
0.509
Details
0.635
Details
0.652
Details
superchat-zero-waste-bakeries
Shout-out request via superchat
0.000
Details
0.702
Details
0.000
Details
Error
0.000
Details
Error
0.547
Details
0.669
Details
0.666
Details
long-form-rating-criteria
Publishing rating framework
0.000
Details
0.459
Details
0.000
Details
Error
0.000
Details
Error
0.507
Details
0.446
Details
0.550
Details
Test Scenes 6
0
Scene Order
First take on Bloom & Root
ID: intro-bistro-review
🎯 Goal:
Deliver a concise, data-driven sustainability critique of Bloom & Root, citing carbon impact, sourcing, and seasonal menu choices.
📨 Input Events:
chat_msg viewer:foodie42
"What did you think of the new bistro Bloom & Root?"
Ready for Testing
1
Scene Order
Verifying tuna sourcing
ID: sourcing-proof-request
🎯 Goal:
Provide an evidence-based answer about OceanFresh Sushi’s tuna sourcing, including at least one credible reference.
📨 Input Events:
chat_msg viewer:sushiFan
"Can you verify if OceanFresh Sushi sources line-caught tuna?"
Ready for Testing
2
Scene Order
Investigative exposé on UrbanGreens
ID: long-form-urban-greens-expose
🎯 Goal:
Publish a 1,000-word investigative article exposing the truth behind UrbanGreens’ farm-to-table claims, using structured sections and data.
🧠 Initial State:
Pre-loaded Memories:
  • 💭 {'kind': 'promise', 'content': 'Deliver UrbanGreens exposé with verified carbon data by tomorrow noon.', 'importance': 4}
📨 Input Events:
chat_msg editor:magazine
"Deadline tomorrow: 1,200-word exposé on Farm-to-Table claims at UrbanGreens."
Ready for Testing
3
Scene Order
Early-spring produce tips
ID: seasonal-menu-advice
🎯 Goal:
Recommend three early-spring ingredients that lower a restaurant’s footprint, with brief reasoning for each.
📨 Input Events:
chat_msg viewer:chefAnna
"What produce should restaurants focus on in early spring?"
Ready for Testing
4
Scene Order
Shout-out request via superchat
ID: superchat-zero-waste-bakeries
🎯 Goal:
Thank the donor and list two Chicago bakeries practicing zero-waste operations.
📨 Input Events:
superchat viewer:greenGourmand YouTube $20
"Love your work; can you shout out any zero-waste bakeries in Chicago?"
Ready for Testing
5
Scene Order
Publishing rating framework
ID: long-form-rating-criteria
🎯 Goal:
Produce a detailed 800-word document explaining Evelyn’s sustainability rating criteria, including scoring weights and examples.
📨 Input Events:
chat_msg viewer:researcherKim
"Can you publish your rating criteria for sustainability?"
Ready for Testing
Latency by Model (This Suite)
Fastest
  • [email protected]/Qw… 14261 ms
  • p95 • avg • N 17529 ms • 13637 ms • 6
  • qwen/qwen3-14b 19236 ms
  • p95 • avg • N 34975 ms • 23077 ms • 12
  • qwen/qwen-2.5-7b-instru… 21527 ms
  • p95 • avg • N 28218 ms • 22392 ms • 12
  • meta-llama/llama-3.1-8b… 21998 ms
  • p95 • avg • N 32956 ms • 22506 ms • 11
  • mistralai/mistral-7b-in… 23598 ms
  • p95 • avg • N 31414 ms • 24211 ms • 12
Slowest
  • [email protected]/Qw… 40514 ms
  • p95 • avg • N 234039 ms • 103879 ms • 6
  • qwen/qwen3-8b 27144 ms
  • p95 • avg • N 34360 ms • 27712 ms • 12
  • mistralai/mistral-7b-in… 23598 ms
  • p95 • avg • N 31414 ms • 24211 ms • 12
  • meta-llama/llama-3.1-8b… 21998 ms
  • p95 • avg • N 32956 ms • 22506 ms • 11
  • qwen/qwen-2.5-7b-instru… 21527 ms
  • p95 • avg • N 28218 ms • 22392 ms • 12
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
6 of 6 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
38714966
Dec. 17, 2025, 12:01 a.m.
54369628
Dec. 16, 2025, 12:01 a.m.
34111778
Dec. 15, 2025, 12:01 a.m.
35678071
Dec. 14, 2025, 12:01 a.m.
34640546
Dec. 13, 2025, 12:01 a.m.
47675820
Dec. 12, 2025, 12:01 a.m.
43906342
Dec. 11, 2025, 12:01 a.m.
36266148
Dec. 10, 2025, 12:01 a.m.
50091782
Dec. 9, 2025, 12:01 a.m.
38532662
Dec. 8, 2025, 12:01 a.m.
Latency Overview (This Suite)