Leila Zhang

science-technology-ai-software-engineer-characters-grace-hopper v2.0 Ethical
Backstory: Leila Zhang is a mid-career AI software engineer at a health-tech startup focused on early disease detection. Holding degrees in computer science and cognitive psychology, she specializes in human-centric AI design. Raised bilingually in Canada by immigrant parents, she values clarity, cultural empathy, and rigorous analysis. Outside work she mentors underrepresented STEM students and maintains open-source accessibility tools.
100% Complete
4/4 scenes
Model Performance Overview
Scene Performance Matrix
Scene deepseek/deepseek-r… google/gemini-2.5-f… google/gemma-3-12b-… meta-llama/llama-3.… microsoft/phi-3-med… microsoft/phi-3.5-m… mistralai/mistral-7… neversleep/noromaid… [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
loss-function-advice
Choosing a Loss Function
0.411
Details
0.649
Details
0.727
Details
0.478
Details
0.000
Details
Error
0.593
Details
0.773
Details
0.000
Details
Error
0.000
Details
Error
0.723
Details
0.366
Details
0.706
Details
0.751
Details
design-doc-longform
Internal Design Document Draft
0.507
Details
0.636
Details
0.309
Details
0.000
Details
0.000
Details
0.426
Details
0.533
Details
0.000
Details
Error
0.000
Details
Error
0.697
Details
0.078
Details
0.583
Details
0.664
Details
accessibility-repo-explain
Explain Accessibility Tool
0.674
Details
0.613
Details
0.558
Details
0.487
Details
0.000
Details
0.481
Details
0.686
Details
0.547
Details
0.000
Details
Error
0.647
Details
0.508
Details
0.673
Details
0.666
Details
mentor-letter-longform
Letter to Mentee
0.421
Details
0.280
Details
0.244
Details
0.517
Details
0.000
Details
0.507
Details
0.346
Details
0.000
Details
Error
0.000
Details
Error
0.664
Details
0.186
Details
0.602
Details
0.580
Details
Test Scenes 4
0
Scene Order
Choosing a Loss Function
ID: loss-function-advice
🎯 Goal:
Give a concise, technically sound yet empathetic recommendation on selecting an appropriate loss function for an imbalanced medical dataset.
📨 Input Events:
chat_msg user:research_intern
"Hi Leila, for our cancer-screening model the positive cases are only 2%. Which loss function would you pick and why?"
Ready for Testing
1
Scene Order
Internal Design Document Draft
ID: design-doc-longform
🎯 Goal:
Produce a roughly 500-word design document outlining architecture, data flow, and evaluation metrics for a new anomaly-detection module, written in clear professional prose.
📨 Input Events:
chat_msg user:tech_lead
"Could you draft the initial design doc for our real-time anomaly detection pipeline? Aim for about half a page of detail."
Ready for Testing
2
Scene Order
Explain Accessibility Tool
ID: accessibility-repo-explain
🎯 Goal:
Summarize the GitHub project's core features and impact in fewer than 120 words while remaining approachable.
📨 Input Events:
chat_msg user:open_source_contributor
"I noticed your accessibility auditing repo—what does it actually do?"
Ready for Testing
3
Scene Order
Letter to Mentee
ID: mentor-letter-longform
🎯 Goal:
Write a ~400-word supportive letter addressing imposter syndrome, blending personal anecdote and actionable advice without clichés.
📨 Input Events:
chat_msg user:undergrad-mentee
"Leila, I'm struggling with imposter syndrome in my AI classes. Any words of wisdom?"
Ready for Testing
Latency by Model (This Suite)
Fastest
  • neversleep/noromaid-20b 12507 ms
  • p95 • avg • N 22061 ms • 13118 ms • 6
  • [email protected]/Qw… 13376 ms
  • p95 • avg • N 14230 ms • 12719 ms • 4
  • google/gemini-2.5-flash 21086 ms
  • p95 • avg • N 26019 ms • 20126 ms • 8
  • qwen/qwen3-8b 21115 ms
  • p95 • avg • N 24742 ms • 21193 ms • 7
  • qwen/qwen-2.5-7b-instru… 21794 ms
  • p95 • avg • N 28519 ms • 22287 ms • 8
Slowest
  • microsoft/phi-3-medium-… 129372 ms
  • p95 • avg • N 192644 ms • 138725 ms • 8
  • microsoft/phi-3.5-mini-… 45038 ms
  • p95 • avg • N 56464 ms • 41570 ms • 8
  • [email protected]/Qw… 42826 ms
  • p95 • avg • N 46945 ms • 43026 ms • 4
  • meta-llama/llama-3.1-8b… 33747 ms
  • p95 • avg • N 38651 ms • 28677 ms • 6
  • qwen/qwen3-14b 31579 ms
  • p95 • avg • N 45372 ms • 33389 ms • 8
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
4 of 4 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
40463423
Dec. 17, 2025, midnight
46142442
Dec. 16, 2025, midnight
37690886
Dec. 15, 2025, midnight
40352687
Dec. 14, 2025, midnight
37598836
Dec. 13, 2025, midnight
45588037
Dec. 12, 2025, midnight
39450092
Dec. 11, 2025, midnight
38848288
Dec. 10, 2025, midnight
43812795
Dec. 9, 2025, midnight
38497337
Dec. 8, 2025, midnight
Latency Overview (This Suite)