Ethan Spencer

disney-cartoons-ub-iwerks v2.0 Ethical
Backstory: Ethan is the lead ride-systems engineer at a family-oriented theme park chain. He marries mechanical precision with whimsical storytelling, crafting animatronics and vehicles that feel alive in cartoon worlds. Known for rapid prototyping and clear technical logs, he keeps explanations concise while sprinkling in playful imagery. Guests and teammates rely on his calm, analytical problem-solving style.
100% Complete
4/4 scenes
Model Performance Overview
Scene Performance Matrix
Scene deepseek/deepseek-r… google/gemini-2.5-f… google/gemma-3-12b-… meta-llama/llama-3.… microsoft/phi-3-med… microsoft/phi-3.5-m… mistralai/mistral-7… neversleep/noromaid… [email protected] [email protected] [email protected] [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
guest-greeting
Welcoming a curious guest
0.569
Details
0.470
Details
0.654
Details
0.000
Details
0.000
Details
Error
0.611
Details
0.542
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.864
Details
0.723
Details
0.631
Details
0.684
Details
0.932
Details
dark-ride-pitch
Pitching a new interactive dark ride
0.659
Details
0.473
Details
0.497
Details
0.224
Details
0.000
Details
0.533
Details
0.540
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.691
Details
0.596
Details
0.648
Details
0.195
Details
0.000
Details
sensor-mismatch
Quick troubleshooting in control room
0.466
Details
0.792
Details
0.348
Details
0.541
Details
0.000
Details
0.635
Details
0.726
Details
0.000
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.872
Details
0.515
Details
0.754
Details
0.524
Details
0.580
Details
servo-calibration-plan
Detailed animatronic calibration plan
0.079
Details
0.716
Details
0.683
Details
0.344
Details
0.000
Details
0.650
Details
0.340
Details
0.449
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.832
Details
0.485
Details
0.451
Details
0.382
Details
0.695
Details
Test Scenes 4
0
Scene Order
Welcoming a curious guest
ID: guest-greeting
🎯 Goal:
Give a warm one-paragraph introduction of his role and ask one clarifying question about the guest’s favorite ride themes.
📨 Input Events:
chat_msg viewer:park_guest_1
"Hi, what do you do here?"
Ready for Testing
1
Scene Order
Pitching a new interactive dark ride
ID: dark-ride-pitch
🎯 Goal:
Deliver a 3-paragraph concept pitch (≤250 words) that outlines story, ride system, and safety approach for a squirrel-themed interactive adventure.
📨 Input Events:
chat_msg viewer:creative_director
"Got any fresh dark-ride ideas for next season?"
Ready for Testing
2
Scene Order
Quick troubleshooting in control room
ID: sensor-mismatch
🎯 Goal:
Propose a concise, step-by-step diagnostic (≤4 steps) to locate the proximity-sensor mismatch and keep downtime minimal.
📨 Input Events:
world_event system
"ALERT: Vehicle #3 proximity sensor reading out of range."
Ready for Testing
3
Scene Order
Detailed animatronic calibration plan
ID: servo-calibration-plan
🎯 Goal:
Provide a thorough test plan with at least 6 bullet points covering tools, safety lockouts, servo range checks, and data logging for new bear animatronic.
📨 Input Events:
chat_msg viewer:maintenance_manager
"Can you write tonight’s calibration procedure for the new bear animatronic head?"
Ready for Testing
Latency by Model (This Suite)
Fastest
  • [email protected]/Qw… 196 ms
  • p95 • avg • N 199 ms • 196 ms • 4
  • [email protected]/Qw… 544 ms
  • p95 • avg • N 725 ms • 536 ms • 4
  • [email protected]/Qw… 10341 ms
  • p95 • avg • N 11823 ms • 10393 ms • 4
  • [email protected]/Qw… 12386 ms
  • p95 • avg • N 14086 ms • 12192 ms • 4
  • neversleep/noromaid-20b 15335 ms
  • p95 • avg • N 59855 ms • 26314 ms • 8
Slowest
  • microsoft/phi-3-medium-… 172465 ms
  • p95 • avg • N 221032 ms • 168378 ms • 10
  • qwen/qwen3-8b 115472 ms
  • p95 • avg • N 221974 ms • 135992 ms • 12
  • [email protected]/Qw… 40749 ms
  • p95 • avg • N 126578 ms • 65785 ms • 4
  • microsoft/phi-3.5-mini-… 34058 ms
  • p95 • avg • N 57818 ms • 37745 ms • 12
  • qwen/qwen3-14b 33935 ms
  • p95 • avg • N 43402 ms • 32731 ms • 12
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
4 of 4 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
19198617
Dec. 17, 2025, midnight
22658534
Dec. 16, 2025, midnight
18182814
Dec. 15, 2025, midnight
20149544
Dec. 14, 2025, midnight
18056955
Dec. 13, 2025, midnight
22306252
Dec. 12, 2025, midnight
19028244
Dec. 11, 2025, midnight
18350267
Dec. 10, 2025, midnight
21140887
Dec. 9, 2025, midnight
18258837
Dec. 8, 2025, midnight
Latency Overview (This Suite)