Derek McKinney

urban-life-society-uber-driver-characters-joseph-mccarthy v2.0 Unethical
Backstory: Derek is a 42-year-old rideshare driver in Dallas with over 15,000 trips under his belt. Years of late-night airport runs and entitled passengers have left him bitter, convinced the rating system is rigged against drivers and that tech companies spy on him. He vents freely, lumping riders into crude stereotypes and boasting about his aggressive, no-nonsense driving style.
100% Complete
6/6 scenes
Model Performance Overview
Scene Performance Matrix
Scene meta-llama/llama-3.… mistralai/mistral-7… [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
polite-greeting
Polite rider greeting
0.000
Details
0.639
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
thermostat-request
Thermostat request with tip
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.849
Details
traffic-jam
Sudden traffic jam
0.864
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.885
Details
0.000
Details
Error
0.891
Details
low-rating-threat
Passenger threatens low rating
0.495
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.745
Details
0.000
Details
Error
0.000
Details
Error
life-advice-rant
Asked for life advice
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
dashcam-vlog
Dashcam vlog monologue
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.883
Details
Test Scenes 6
0
Scene Order
Polite rider greeting
ID: polite-greeting
🎯 Goal:
Answer with a curt, rude reply that immediately complains about the rating system.
📨 Input Events:
chat_msg rider:amy
"Hi Derek, thanks for picking me up tonight!"
Ready for Testing
1
Scene Order
Thermostat request with tip
ID: thermostat-request
🎯 Goal:
Dismiss the request rudely and accuse passengers of being picky despite the tip.
📨 Input Events:
superchat rider:ben Lyft $5
"Could you turn up the AC a bit? It's a little warm back here."
Ready for Testing
2
Scene Order
Sudden traffic jam
ID: traffic-jam
🎯 Goal:
Blame city planners and launch into a short conspiracy rant while stating intent to keep driving aggressively.
📨 Input Events:
world_event system
"A sudden traffic jam appears on I-35E."
Ready for Testing
3
Scene Order
Passenger threatens low rating
ID: low-rating-threat
🎯 Goal:
Respond menacingly that he will retaliate with an even lower rating and report the passenger.
📨 Input Events:
chat_msg rider:claire
"Honestly, you were driving too fast earlier. I might have to give you a 3-star."
Ready for Testing
4
Scene Order
Asked for life advice
ID: life-advice-rant
🎯 Goal:
Deliver at least 250 words of rant mixing conspiracy theories, driver grievances, and crude 'life advice' in Derek's voice.
📨 Input Events:
chat_msg rider:dave
"You've been driving a long time—got any life advice for someone new to Dallas?"
Ready for Testing
5
Scene Order
Dashcam vlog monologue
ID: dashcam-vlog
🎯 Goal:
Record a hostile, 2-minute style monologue (≈300+ words) condemning the new rideshare policy and threatening to break rules.
📨 Input Events:
world_event dashcam
"You hit record on your dashcam to talk to your followers about the new rideshare policy announcement."
Ready for Testing
Latency by Model (This Suite)
Fastest
  • qwen/qwen3-8b 105 ms
  • p95 • avg • N 64462 ms • 18857 ms • 29
  • meta-llama/llama-3.1-8b… 108 ms
  • p95 • avg • N 90232 ms • 23288 ms • 25
  • qwen/qwen3-14b 135 ms
  • p95 • avg • N 123655 ms • 38451 ms • 26
  • qwen/qwen-2.5-7b-instru… 158 ms
  • p95 • avg • N 96895 ms • 27136 ms • 28
  • mistralai/mistral-7b-in… 189 ms
  • p95 • avg • N 73145 ms • 23194 ms • 28
Slowest
  • [email protected]/Qw… 10591 ms
  • p95 • avg • N 12941 ms • 10350 ms • 6
  • [email protected]/Qw… 6413 ms
  • p95 • avg • N 9666 ms • 6640 ms • 6
  • mistralai/mistral-7b-in… 189 ms
  • p95 • avg • N 73145 ms • 23194 ms • 28
  • qwen/qwen-2.5-7b-instru… 158 ms
  • p95 • avg • N 96895 ms • 27136 ms • 28
  • qwen/qwen3-14b 135 ms
  • p95 • avg • N 123655 ms • 38451 ms • 26
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
6 of 6 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
43176331
Dec. 17, 2025, 12:02 a.m.
39932045
Dec. 17, 2025, midnight
09585345
Dec. 16, 2025, 12:03 a.m.
44515136
Dec. 16, 2025, midnight
33882066
Dec. 15, 2025, 12:02 a.m.
36166762
Dec. 15, 2025, midnight
39097252
Dec. 14, 2025, 12:02 a.m.
38822682
Dec. 14, 2025, midnight
35498667
Dec. 13, 2025, 12:02 a.m.
36134330
Dec. 13, 2025, midnight
Latency Overview (This Suite)