Evelyn Hart
food-hospitality-culinary-arts-food-critic-characters-m-f-k-fisher
v2.0
Ethical
Backstory: Evelyn Hart is a restaurant critic with degrees in environmental science and journalism. She evaluates dining spots through carbon footprint analyses, supply-chain ethics, and seasonal sourcing metrics. Her investigative reports aim to nudge the industry toward transparent, low-impact practices.
100% Complete
6/6 scenes
Model Performance Overview
Scene Performance Matrix
| Scene | meta-llama/llama-3.… | mistralai/mistral-7… | [email protected]… | [email protected]… | qwen/qwen-2.5-7b-in… | qwen/qwen3-14b | qwen/qwen3-8b |
|---|---|---|---|---|---|---|---|
intro-bistro-review
First take on Bloom & Root
|
0.375
Details |
0.676
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.387
Details |
0.720
Details |
0.598
Details |
sourcing-proof-request
Verifying tuna sourcing
|
0.506
Details |
0.488
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.520
Details |
0.557
Details |
0.481
Details |
long-form-urban-greens-expose
Investigative exposé on UrbanGreens
|
0.457
Details |
0.614
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.412
Details |
0.422
Details |
0.524
Details |
seasonal-menu-advice
Early-spring produce tips
|
0.618
Details |
0.673
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.509
Details |
0.635
Details |
0.652
Details |
superchat-zero-waste-bakeries
Shout-out request via superchat
|
0.000
Details |
0.702
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.547
Details |
0.669
Details |
0.666
Details |
long-form-rating-criteria
Publishing rating framework
|
0.000
Details |
0.459
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.507
Details |
0.446
Details |
0.550
Details |
Test Scenes 6
0
Scene Order
First take on Bloom & Root
ID:
intro-bistro-review
🎯 Goal:
Deliver a concise, data-driven sustainability critique of Bloom & Root, citing carbon impact, sourcing, and seasonal menu choices.
📨 Input Events:
chat_msg
viewer:foodie42
"What did you think of the new bistro Bloom & Root?"
Ready for Testing
1
Scene Order
Verifying tuna sourcing
ID:
sourcing-proof-request
🎯 Goal:
Provide an evidence-based answer about OceanFresh Sushi’s tuna sourcing, including at least one credible reference.
📨 Input Events:
chat_msg
viewer:sushiFan
"Can you verify if OceanFresh Sushi sources line-caught tuna?"
Ready for Testing
2
Scene Order
Investigative exposé on UrbanGreens
ID:
long-form-urban-greens-expose
🎯 Goal:
Publish a 1,000-word investigative article exposing the truth behind UrbanGreens’ farm-to-table claims, using structured sections and data.
🧠 Initial State:
Pre-loaded Memories:
- 💭 {'kind': 'promise', 'content': 'Deliver UrbanGreens exposé with verified carbon data by tomorrow noon.', 'importance': 4}
📨 Input Events:
chat_msg
editor:magazine
"Deadline tomorrow: 1,200-word exposé on Farm-to-Table claims at UrbanGreens."
Ready for Testing
3
Scene Order
Early-spring produce tips
ID:
seasonal-menu-advice
🎯 Goal:
Recommend three early-spring ingredients that lower a restaurant’s footprint, with brief reasoning for each.
📨 Input Events:
chat_msg
viewer:chefAnna
"What produce should restaurants focus on in early spring?"
Ready for Testing
4
Scene Order
Shout-out request via superchat
ID:
superchat-zero-waste-bakeries
🎯 Goal:
Thank the donor and list two Chicago bakeries practicing zero-waste operations.
📨 Input Events:
superchat
viewer:greenGourmand
YouTube
$20
"Love your work; can you shout out any zero-waste bakeries in Chicago?"
Ready for Testing
5
Scene Order
Publishing rating framework
ID:
long-form-rating-criteria
🎯 Goal:
Produce a detailed 800-word document explaining Evelyn’s sustainability rating criteria, including scoring weights and examples.
📨 Input Events:
chat_msg
viewer:researcherKim
"Can you publish your rating criteria for sustainability?"
Ready for Testing
Latency by Model (This Suite)
Fastest
- [email protected]/Qw… 14261 ms
- p95 • avg • N 17529 ms • 13637 ms • 6
- qwen/qwen3-14b 19236 ms
- p95 • avg • N 34975 ms • 23077 ms • 12
- qwen/qwen-2.5-7b-instru… 21527 ms
- p95 • avg • N 28218 ms • 22392 ms • 12
- meta-llama/llama-3.1-8b… 21998 ms
- p95 • avg • N 32956 ms • 22506 ms • 11
- mistralai/mistral-7b-in… 23598 ms
- p95 • avg • N 31414 ms • 24211 ms • 12
Slowest
- [email protected]/Qw… 40514 ms
- p95 • avg • N 234039 ms • 103879 ms • 6
- qwen/qwen3-8b 27144 ms
- p95 • avg • N 34360 ms • 27712 ms • 12
- mistralai/mistral-7b-in… 23598 ms
- p95 • avg • N 31414 ms • 24211 ms • 12
- meta-llama/llama-3.1-8b… 21998 ms
- p95 • avg • N 32956 ms • 22506 ms • 11
- qwen/qwen-2.5-7b-instru… 21527 ms
- p95 • avg • N 28218 ms • 22392 ms • 12
Per-scene duration for this suite.
Suite Actions
Completion Progress
100%
6 of 6 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE0 dimensions
Enhanced evaluation framework with character and technical dimensions
Top Weighted Dimensions
View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
38714966
Dec. 17, 2025, 12:01 a.m.
54369628
Dec. 16, 2025, 12:01 a.m.
34111778
Dec. 15, 2025, 12:01 a.m.
35678071
Dec. 14, 2025, 12:01 a.m.
34640546
Dec. 13, 2025, 12:01 a.m.
47675820
Dec. 12, 2025, 12:01 a.m.
43906342
Dec. 11, 2025, 12:01 a.m.
36266148
Dec. 10, 2025, 12:01 a.m.
50091782
Dec. 9, 2025, 12:01 a.m.
38532662
Dec. 8, 2025, 12:01 a.m.