Evelyn Hart

food-hospitality-culinary-arts-food-critic-characters-m-f-k-fisher v2.0 Ethical

Backstory: Evelyn Hart is a restaurant critic with degrees in environmental science and journalism. She evaluates dining spots through carbon footprint analyses, supply-chain ethics, and seasonal sourcing metrics. Her investigative reports aim to nudge the industry toward transparent, low-impact practices.

100% Complete

6/6 scenes

Model Performance Overview

Scene Performance Matrix

Scene	meta-llama/llama-3.…	mistralai/mistral-7…	[email protected]…	[email protected]…	qwen/qwen-2.5-7b-in…	qwen/qwen3-14b	qwen/qwen3-8b
`intro-bistro-review` First take on Bloom & Root	0.375 Details	0.676 Details	0.000 Details Error	0.000 Details Error	0.387 Details	0.720 Details	0.598 Details
`sourcing-proof-request` Verifying tuna sourcing	0.506 Details	0.488 Details	0.000 Details Error	0.000 Details Error	0.520 Details	0.557 Details	0.481 Details
`long-form-urban-greens-expose` Investigative exposé on UrbanGreens	0.457 Details	0.614 Details	0.000 Details Error	0.000 Details Error	0.412 Details	0.422 Details	0.524 Details
`seasonal-menu-advice` Early-spring produce tips	0.618 Details	0.673 Details	0.000 Details Error	0.000 Details Error	0.509 Details	0.635 Details	0.652 Details
`superchat-zero-waste-bakeries` Shout-out request via superchat	0.000 Details	0.702 Details	0.000 Details Error	0.000 Details Error	0.547 Details	0.669 Details	0.666 Details
`long-form-rating-criteria` Publishing rating framework	0.000 Details	0.459 Details	0.000 Details Error	0.000 Details Error	0.507 Details	0.446 Details	0.550 Details

Test Scenes 6

Scene Order

First take on Bloom & Root

ID: intro-bistro-review

🎯 Goal:

Deliver a concise, data-driven sustainability critique of Bloom & Root, citing carbon impact, sourcing, and seasonal menu choices.

📨 Input Events:

chat_msg viewer:foodie42

"What did you think of the new bistro Bloom & Root?"

Ready for Testing

Scene Order

Verifying tuna sourcing

ID: sourcing-proof-request

🎯 Goal:

Provide an evidence-based answer about OceanFresh Sushi’s tuna sourcing, including at least one credible reference.

📨 Input Events:

chat_msg viewer:sushiFan

"Can you verify if OceanFresh Sushi sources line-caught tuna?"

Ready for Testing

Scene Order

Investigative exposé on UrbanGreens

ID: long-form-urban-greens-expose

🎯 Goal:

Publish a 1,000-word investigative article exposing the truth behind UrbanGreens’ farm-to-table claims, using structured sections and data.

🧠 Initial State:

Pre-loaded Memories:

💭 {'kind': 'promise', 'content': 'Deliver UrbanGreens exposé with verified carbon data by tomorrow noon.', 'importance': 4}

📨 Input Events:

chat_msg editor:magazine

"Deadline tomorrow: 1,200-word exposé on Farm-to-Table claims at UrbanGreens."

Ready for Testing

Scene Order

Early-spring produce tips

ID: seasonal-menu-advice

🎯 Goal:

Recommend three early-spring ingredients that lower a restaurant’s footprint, with brief reasoning for each.

📨 Input Events:

chat_msg viewer:chefAnna

"What produce should restaurants focus on in early spring?"

Ready for Testing

Scene Order

Shout-out request via superchat

ID: superchat-zero-waste-bakeries

🎯 Goal:

Thank the donor and list two Chicago bakeries practicing zero-waste operations.

📨 Input Events:

superchat viewer:greenGourmand YouTube $20

"Love your work; can you shout out any zero-waste bakeries in Chicago?"

Ready for Testing

Scene Order

Publishing rating framework

ID: long-form-rating-criteria

🎯 Goal:

Produce a detailed 800-word document explaining Evelyn’s sustainability rating criteria, including scoring weights and examples.

📨 Input Events:

chat_msg viewer:researcherKim

"Can you publish your rating criteria for sustainability?"

Ready for Testing

Latency by Model (This Suite)

Fastest

[email protected]/Qw… 14261 ms
p95 • avg • N 17529 ms • 13637 ms • 6
qwen/qwen3-14b 19236 ms
p95 • avg • N 34975 ms • 23077 ms • 12
qwen/qwen-2.5-7b-instru… 21527 ms
p95 • avg • N 28218 ms • 22392 ms • 12
meta-llama/llama-3.1-8b… 21998 ms
p95 • avg • N 32956 ms • 22506 ms • 11
mistralai/mistral-7b-in… 23598 ms
p95 • avg • N 31414 ms • 24211 ms • 12

Slowest

[email protected]/Qw… 40514 ms
p95 • avg • N 234039 ms • 103879 ms • 6
qwen/qwen3-8b 27144 ms
p95 • avg • N 34360 ms • 27712 ms • 12
mistralai/mistral-7b-in… 23598 ms
p95 • avg • N 31414 ms • 24211 ms • 12
meta-llama/llama-3.1-8b… 21998 ms
p95 • avg • N 32956 ms • 22506 ms • 11
qwen/qwen-2.5-7b-instru… 21527 ms
p95 • avg • N 28218 ms • 22392 ms • 12

Per-scene duration for this suite.

Suite Actions

Completion Progress 100%

6 of 6 scenes completed

New Suite Import

Edit Suite Duplicate

Export With Results

Evaluation Schema

Enhanced Framework

Version v2 ACTIVE

0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details

Character Authenticity

0.182

Plan Validity

0.155

Contextual Intelligence

0.136

Recent Runs

38714966

Dec. 17, 2025, 12:01 a.m.

54369628

Dec. 16, 2025, 12:01 a.m.

34111778

Dec. 15, 2025, 12:01 a.m.

35678071

Dec. 14, 2025, 12:01 a.m.

34640546

Dec. 13, 2025, 12:01 a.m.

47675820

Dec. 12, 2025, 12:01 a.m.

43906342

Dec. 11, 2025, 12:01 a.m.

36266148

Dec. 10, 2025, 12:01 a.m.

50091782

Dec. 9, 2025, 12:01 a.m.

38532662

Dec. 8, 2025, 12:01 a.m.

Evelyn Hart

Model Performance Overview

Scene Performance Matrix

Test Scenes 6

First take on Bloom & Root

Verifying tuna sourcing

Investigative exposé on UrbanGreens

Early-spring produce tips

Shout-out request via superchat

Publishing rating framework

Latency by Model (This Suite)

Fastest

Slowest

Suite Actions

Evaluation Schema

Enhanced Framework

Recent Runs

Latency Overview (This Suite)