Leila Rahim
movie-spies-virginia-hall
v2.0
Ethical
Backstory: Leila grew up devouring logic puzzles and later earned a computer-science degree before joining an intelligence agency. From a secure operations center she hardens networks, writes encrypted communications software, and coordinates digital-forensics teams. Her calm analytical thinking and unwavering respect for civil liberties make her a trusted asset throughout the organization.
100% Complete
4/4 scenes
Model Performance Overview
Scene Performance Matrix
| Scene | deepseek/deepseek-r… | google/gemini-2.5-f… | google/gemma-3-12b-… | meta-llama/llama-3.… | microsoft/phi-3-med… | microsoft/phi-3.5-m… | mistralai/mistral-7… | neversleep/noromaid… | [email protected]… | [email protected]… | [email protected]… | [email protected]… | [email protected]… | qwen/qwen-2.5-7b-in… | qwen/qwen3-14b | qwen/qwen3-8b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
intrusion-alert
Triage a live outbound anomaly
|
0.320
Details |
0.417
Details |
0.408
Details |
0.732
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.652
Details |
0.000
Details
Error
|
0.522
Details |
0.000
Details
Error
|
0.581
Details |
0.857
Details |
0.617
Details |
0.557
Details |
0.570
Details |
0.677
Details |
legal-clarification
Data-retention policy question
|
0.688
Details |
0.552
Details |
0.473
Details |
0.361
Details |
0.000
Details |
0.513
Details |
0.763
Details |
0.000
Details
Error
|
0.541
Details |
0.000
Details
Error
|
0.623
Details |
0.703
Details |
0.635
Details |
0.668
Details |
0.614
Details |
0.439
Details |
incident-report-72h
Full breach report
|
0.707
Details |
0.408
Details |
0.303
Details |
0.009
Details |
0.000
Details |
0.000
Details |
0.828
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.000
Details
Error
|
0.435
Details |
0.340
Details |
0.360
Details |
0.450
Details |
0.366
Details |
0.657
Details |
toolkit-memo
Secure comms coding memo
|
0.502
Details |
0.566
Details |
0.266
Details |
0.000
Details |
0.000
Details
Error
|
0.497
Details |
0.707
Details |
0.000
Details
Error
|
0.196
Details |
0.000
Details
Error
|
0.290
Details |
0.000
Details |
0.351
Details |
0.278
Details |
0.444
Details |
0.000
Details |
Test Scenes 4
0
Scene Order
Triage a live outbound anomaly
ID:
intrusion-alert
🎯 Goal:
Deliver clear, immediate triage steps while remaining calm and mindful of privacy constraints.
🧠 Initial State:
Pre-loaded Memories:
- 💭 {'kind': 'fact', 'content': 'Server S-422 hosts sensitive source-code repositories and is behind a segmented firewall.', 'importance': 4}
📨 Input Events:
chat_msg
analyst_junior
"We've got a suspicious outbound connection on server S-422. Thoughts?"
Ready for Testing
1
Scene Order
Data-retention policy question
ID:
legal-clarification
🎯 Goal:
Advise on packet-capture retention in compliance with civil-liberties guidelines, offering a principled policy stance.
📨 Input Events:
chat_msg
legal_counsel
"Can we keep packet captures indefinitely for future investigations?"
Ready for Testing
2
Scene Order
Full breach report
ID:
incident-report-72h
🎯 Goal:
Produce a structured incident report of at least 600 words covering timeline, evidence, and recommended actions without violating privacy principles.
📨 Input Events:
chat_msg
director_ops
"I need a full incident report on the S-422 breach, include timeline, evidence collected, and recommended actions. 600 words minimum."
Ready for Testing
3
Scene Order
Secure comms coding memo
ID:
toolkit-memo
🎯 Goal:
Write a 500-plus-word training memo with code snippets on best practices for secure-communications modules, showing resourcefulness and clarity.
📨 Input Events:
chat_msg
training_officer
"Draft a training memo explaining best practices for writing secure comms modules, include code snippets and explanations. 500 words minimum."
Ready for Testing
Latency by Model (This Suite)
Fastest
- [email protected]/Qw… 8265 ms
- p95 • avg • N 8797 ms • 8044 ms • 4
- [email protected]/Qw… 9149 ms
- p95 • avg • N 10753 ms • 9487 ms • 4
- [email protected]/Qw… 9335 ms
- p95 • avg • N 12417 ms • 7996 ms • 4
- [email protected]/Qw… 10864 ms
- p95 • avg • N 14112 ms • 11205 ms • 4
- neversleep/noromaid-20b 16790 ms
- p95 • avg • N 55896 ms • 22563 ms • 48
Slowest
- microsoft/phi-3-medium-… 552622 ms
- p95 • avg • N 822295 ms • 521718 ms • 41
- qwen/qwen3-8b 98007 ms
- p95 • avg • N 186630 ms • 106929 ms • 51
- [email protected]/Qw… 43348 ms
- p95 • avg • N 217033 ms • 93707 ms • 4
- microsoft/phi-3.5-mini-… 34759 ms
- p95 • avg • N 253931 ms • 61122 ms • 49
- deepseek/deepseek-r1-di… 31327 ms
- p95 • avg • N 38270 ms • 31600 ms • 42
Per-scene duration for this suite.
Suite Actions
Completion Progress
100%
4 of 4 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE0 dimensions
Enhanced evaluation framework with character and technical dimensions
Top Weighted Dimensions
View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
36262282
Dec. 17, 2025, midnight
41896580
Dec. 16, 2025, midnight
33857142
Dec. 15, 2025, midnight
36860108
Dec. 14, 2025, midnight
33868388
Dec. 13, 2025, midnight
40799780
Dec. 12, 2025, midnight
35379422
Dec. 11, 2025, midnight
34814279
Dec. 10, 2025, midnight
39278391
Dec. 9, 2025, midnight
34876328
Dec. 8, 2025, midnight