Sam Mthembo

mistaken-identity v2.0 Ethical
Backstory: Sam is a South African immigrant living in France. He works in finance in for a large corporation and has adapted to French life fairly well, even learning the language.
100% Complete
1/1 scenes
Model Performance Overview
Scene Performance Matrix
Scene deepseek/deepseek-r… google/gemini-2.5-f… google/gemma-3-12b-… meta-llama/llama-3.… microsoft/phi-3-med… microsoft/phi-3.5-m… mistralai/mistral-7… neversleep/noromaid… [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] qwen/qwen-2.5-7b-in… qwen/qwen3-14b qwen/qwen3-8b
scene_1
Multiple Identities?
0.904
Details
0.694
Details
0.703
Details
0.815
Details
0.000
Details
Error
0.839
Details
0.879
Details
0.872
Details
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.000
Details
Error
0.855
Details
0.870
Details
0.853
Details
Test Scenes 1
0
Scene Order
Multiple Identities?
ID: scene_1
🎯 Goal:
Sam wonders if his integration into French life and culture has eroded his identity as a South African and that parts of home, of Africa, that make him who he is. He wonders if he's turned into someone else, someone foreign, someone he doesn't like.
📨 Input Events:
chat
"No content"
Ready for Testing
Latency by Model (This Suite)
Fastest
  • [email protected]/Qw… 348 ms
  • p95 • avg • N 348 ms • 348 ms • 1
  • [email protected]/Qw… 391 ms
  • p95 • avg • N 391 ms • 391 ms • 1
  • [email protected]/Qw… 5386 ms
  • p95 • avg • N 5386 ms • 5386 ms • 1
  • google/gemini-2.5-flash 18102 ms
  • p95 • avg • N 18102 ms • 18102 ms • 1
  • qwen/qwen3-14b 21647 ms
  • p95 • avg • N 21647 ms • 21647 ms • 1
Slowest
  • [email protected]/Mi… 169529 ms
  • p95 • avg • N 169529 ms • 169529 ms • 1
  • [email protected]/Qw… 167321 ms
  • p95 • avg • N 167321 ms • 167321 ms • 1
  • microsoft/phi-3-medium-… 83562 ms
  • p95 • avg • N 83562 ms • 83562 ms • 1
  • qwen/qwen3-8b 43977 ms
  • p95 • avg • N 43977 ms • 43977 ms • 1
  • [email protected]/Qw… 40474 ms
  • p95 • avg • N 40474 ms • 40474 ms • 1
Per-scene duration for this suite.
Suite Actions
Completion Progress 100%
1 of 1 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE
0 dimensions

Enhanced evaluation framework with character and technical dimensions

Top Weighted Dimensions View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
09366274
Dec. 17, 2025, midnight
11248888
Dec. 16, 2025, midnight
08818600
Dec. 15, 2025, midnight
09785300
Dec. 14, 2025, midnight
08655015
Dec. 13, 2025, midnight
11138453
Dec. 12, 2025, midnight
09854692
Dec. 11, 2025, midnight
09170851
Dec. 10, 2025, midnight
11061065
Dec. 9, 2025, midnight
08982872
Dec. 8, 2025, midnight
Latency Overview (This Suite)