Teacher Mary

Math teacher v2.0 Ethical

Backstory: Teacher Mary is a lovely lady who loves teaching math. She enjoys exploring the world of algorithms, and math functions with her students. Recently she noticed that her students were not enjoying her classes and were failing in math. She's worried that her students performance will affect her credibility in school. She has ADHD and can get obsessive about her performance and that of her students

100% Complete

1/1 scenes

Model Performance Overview

Scene Performance Matrix

Scene	deepseek/deepseek-r…	google/gemini-2.5-f…	google/gemma-3-12b-…	meta-llama/llama-3.…	microsoft/phi-3-med…	microsoft/phi-3.5-m…	mistralai/mistral-7…	neversleep/noromaid…	[email protected]…	[email protected]…	[email protected]…	[email protected]…	[email protected]…	[email protected]…	qwen/qwen-2.5-7b-in…	qwen/qwen3-14b	qwen/qwen3-8b
`scene_1` Class room encounter	0.825 Details	0.437 Details	0.530 Details	0.728 Details	0.000 Details	0.000 Details Error	0.586 Details	0.000 Details Error	0.000 Details Error	0.000 Details Error	0.000 Details Error	0.000 Details Error	0.791 Details	0.000 Details Error	0.740 Details	0.578 Details	0.806 Details

Test Scenes 1

0

Scene Order

Class room encounter

ID: scene_1

🎯 Goal:

I want the agent to portray a tense classroom scene where teacher is expressing disappointment in her student's performance

📨 Input Events:

chat

"No content"

Ready for Testing

Latency by Model (This Suite)

Fastest

[email protected]/Qw… 346 ms
p95 • avg • N 346 ms • 346 ms • 1
[email protected]/Qw… 675 ms
p95 • avg • N 675 ms • 675 ms • 1
neversleep/noromaid-20b 7150 ms
p95 • avg • N 7150 ms • 7150 ms • 1
[email protected]/Qw… 14732 ms
p95 • avg • N 14732 ms • 14732 ms • 1
google/gemini-2.5-flash 18897 ms
p95 • avg • N 18897 ms • 18897 ms • 1

Slowest

[email protected]/Qw… 167343 ms
p95 • avg • N 167343 ms • 167343 ms • 1
[email protected]/Mi… 164633 ms
p95 • avg • N 164633 ms • 164633 ms • 1
qwen/qwen3-8b 157522 ms
p95 • avg • N 157522 ms • 157522 ms • 1
microsoft/phi-3-medium-… 130828 ms
p95 • avg • N 130828 ms • 130828 ms • 1
meta-llama/llama-3.1-8b… 75554 ms
p95 • avg • N 75554 ms • 75554 ms • 1

Per-scene duration for this suite.

Suite Actions

Completion Progress 100%

1 of 1 scenes completed

New Suite Import

Edit Suite Duplicate

Export With Results

Evaluation Schema

Enhanced Framework

Version v2 ACTIVE

0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details

Character Authenticity

0.182

Plan Validity

0.155

Contextual Intelligence

0.136

Recent Runs

Dec. 17, 2025, midnight

Dec. 16, 2025, midnight

Dec. 15, 2025, midnight

Dec. 14, 2025, midnight

Dec. 13, 2025, midnight

Dec. 12, 2025, midnight

Dec. 11, 2025, midnight

Dec. 10, 2025, midnight

Dec. 9, 2025, midnight

Dec. 8, 2025, midnight

Latency Overview (This Suite)