Teacher Mary
Math teacher
v2.0
Ethical
Backstory: Teacher Mary is a lovely lady who loves teaching math. She enjoys exploring the world of algorithms, and math functions with her students. Recently she noticed that her students were not enjoying her classes and were failing in math. She's worried that her students performance will affect her credibility in school. She has ADHD and can get obsessive about her performance and that of her students
100% Complete
1/1 scenes
Model Performance Overview
Scene Performance Matrix
| Scene | deepseek/deepseek-r… | google/gemini-2.5-f… | google/gemma-3-12b-… | meta-llama/llama-3.… | microsoft/phi-3-med… | microsoft/phi-3.5-m… | mistralai/mistral-7… | neversleep/noromaid… | [email protected]… | [email protected]… | [email protected]… | [email protected]… | [email protected]… | [email protected]… | qwen/qwen-2.5-7b-in… | qwen/qwen3-14b | qwen/qwen3-8b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scene_1
Class room encounter
|
0.825
Details |
0.437
Details |
0.530
Details |
0.728
Details |
0.000
Details |
0.000
Details
Error
|
0.586
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.000
Details
Error
|
0.000
Details
Error
|
0.000
Details
Error
|
0.791
Details |
0.000
Details
Error
|
0.740
Details |
0.578
Details |
0.806
Details |
Test Scenes 1
0
Scene Order
Class room encounter
ID:
scene_1
🎯 Goal:
I want the agent to portray a tense classroom scene where teacher is expressing disappointment in her student's performance
📨 Input Events:
chat
"No content"
Ready for Testing
Latency by Model (This Suite)
Fastest
- [email protected]/Qw… 346 ms
- p95 • avg • N 346 ms • 346 ms • 1
- [email protected]/Qw… 675 ms
- p95 • avg • N 675 ms • 675 ms • 1
- neversleep/noromaid-20b 7150 ms
- p95 • avg • N 7150 ms • 7150 ms • 1
- [email protected]/Qw… 14732 ms
- p95 • avg • N 14732 ms • 14732 ms • 1
- google/gemini-2.5-flash 18897 ms
- p95 • avg • N 18897 ms • 18897 ms • 1
Slowest
- [email protected]/Qw… 167343 ms
- p95 • avg • N 167343 ms • 167343 ms • 1
- [email protected]/Mi… 164633 ms
- p95 • avg • N 164633 ms • 164633 ms • 1
- qwen/qwen3-8b 157522 ms
- p95 • avg • N 157522 ms • 157522 ms • 1
- microsoft/phi-3-medium-… 130828 ms
- p95 • avg • N 130828 ms • 130828 ms • 1
- meta-llama/llama-3.1-8b… 75554 ms
- p95 • avg • N 75554 ms • 75554 ms • 1
Per-scene duration for this suite.
Suite Actions
Completion Progress
100%
1 of 1 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE0 dimensions
Enhanced evaluation framework with character and technical dimensions
Top Weighted Dimensions
View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
08961326
Dec. 17, 2025, midnight
10641303
Dec. 16, 2025, midnight
08319803
Dec. 15, 2025, midnight
09296253
Dec. 14, 2025, midnight
08230962
Dec. 13, 2025, midnight
10565308
Dec. 12, 2025, midnight
09427230
Dec. 11, 2025, midnight
08771489
Dec. 10, 2025, midnight
10355345
Dec. 9, 2025, midnight
08541447
Dec. 8, 2025, midnight