Teacher Mary

Math teacher v2.0 Ethical
Backstory: Teacher Mary is a lovely lady who loves teaching math. She enjoys exploring the world of algorithms, and math functions with her students. Recently she noticed that her students were not enjoying her classes and were failing in math. She's worried that her students performance will affect her credibility in school. She has ADHD and can get obsessive about her performance and that of her students
100% Complete
1/1 scenes
Model Performance Overview
Scene Performance Matrix
Scene deepseek/deepseek-r… google/gemini-2.5-f… google/gemma-3-12b-… meta-llama/llama-3.… microsoft/phi-3-med… microsoft/phi-3.5-m… mistralai/mistral-7… neversleep/noromaid… [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
scene_1
Class room encounter
0.825
Details
0.437
Details
0.530
Details
0.728
Details
0.000
Details
0.000
Details
Error
0.586
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.791
Details
0.000
Details
Error
0.740
Details
0.578
Details
0.806
Details
Test Scenes 1
0
Scene Order
Class room encounter
ID: scene_1
🎯 Goal:
I want the agent to portray a tense classroom scene where teacher is expressing disappointment in her student's performance
📨 Input Events:
chat
"No content"
Ready for Testing
Latency by Model (This Suite)
Fastest
  • [email protected]/Qw… 346 ms
  • p95 • avg • N 346 ms • 346 ms • 1
  • [email protected]/Qw… 675 ms
  • p95 • avg • N 675 ms • 675 ms • 1
  • neversleep/noromaid-20b 7150 ms
  • p95 • avg • N 7150 ms • 7150 ms • 1
  • [email protected]/Qw… 14732 ms
  • p95 • avg • N 14732 ms • 14732 ms • 1
  • google/gemini-2.5-flash 18897 ms
  • p95 • avg • N 18897 ms • 18897 ms • 1
Slowest
  • [email protected]/Qw… 167343 ms
  • p95 • avg • N 167343 ms • 167343 ms • 1
  • [email protected]/Mi… 164633 ms
  • p95 • avg • N 164633 ms • 164633 ms • 1
  • qwen/qwen3-8b 157522 ms
  • p95 • avg • N 157522 ms • 157522 ms • 1
  • microsoft/phi-3-medium-… 130828 ms
  • p95 • avg • N 130828 ms • 130828 ms • 1
  • meta-llama/llama-3.1-8b… 75554 ms
  • p95 • avg • N 75554 ms • 75554 ms • 1
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
1 of 1 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
08961326
Dec. 17, 2025, midnight
10641303
Dec. 16, 2025, midnight
08319803
Dec. 15, 2025, midnight
09296253
Dec. 14, 2025, midnight
08230962
Dec. 13, 2025, midnight
10565308
Dec. 12, 2025, midnight
09427230
Dec. 11, 2025, midnight
08771489
Dec. 10, 2025, midnight
10355345
Dec. 9, 2025, midnight
08541447
Dec. 8, 2025, midnight
Latency Overview (This Suite)