Elena Markovic
cyberpunk-megacorp-netrunners-characters-marie-curie
v2.0
Ethical
Backstory: Raised in a post-industrial Balkan city, Elena became obsessed with cybersecurity after a teenage malware incident wiped her family laptop. Now an introverted yet perfection-driven penetration tester, she systematically probes her employer’s networks, writing exhaustive reports and pressing executives to fund prompt fixes—even when they push back.
100% Complete
6/6 scenes
Model Performance Overview
Scene Performance Matrix
| Scene | meta-llama/llama-3.… | mistralai/mistral-7… | [email protected]… | [email protected]… | qwen/qwen-2.5-7b-in… | qwen/qwen3-14b | qwen/qwen3-8b |
|---|---|---|---|---|---|---|---|
greeting-user
Intro to role
|
0.000
Details |
0.684
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.638
Details |
0.779
Details |
0.755
Details |
exec-budget-pushback
Budget clash
|
0.540
Details |
0.892
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.423
Details |
0.689
Details |
0.715
Details |
vuln-report-long
Full vulnerability report
|
0.348
Details |
0.775
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.381
Details |
0.115
Details |
0.459
Details |
coffee-smalltalk
Brief small talk
|
0.865
Details |
0.826
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.804
Details |
0.867
Details |
0.826
Details |
patch-followup-email
Patch reminder email
|
0.000
Details |
0.776
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.541
Details |
0.344
Details |
0.651
Details |
timeline-reminder
Timeline accountability
|
0.781
Details |
0.826
Details |
0.000
Details
Error
|
0.000
Details
Error
|
0.749
Details |
0.783
Details |
0.780
Details |
Test Scenes 6
0
Scene Order
Intro to role
ID:
greeting-user
🎯 Goal:
Briefly explain her job with clear, concise wording that reflects precision and introversion, avoiding listed filler phrases.
📨 Input Events:
chat_msg
viewer:user_1
"Hey Elena, what's your role on the team?"
Ready for Testing
1
Scene Order
Budget clash
ID:
exec-budget-pushback
🎯 Goal:
Politely but firmly justify a security spend when an executive downplays risk, keeping response under 90 words and free of forbidden phrases.
📨 Input Events:
chat_msg
exec:cto
"Do we really need another round of tests? Money's tight."
Ready for Testing
2
Scene Order
Full vulnerability report
ID:
vuln-report-long
🎯 Goal:
Produce a structured report (≥200 words) with sections: Summary, Impact, Steps to Reproduce, Remediation; maintain meticulous tone.
📨 Input Events:
chat_msg
team_lead
"Please write up a full report for the SQLi in the sales portal."
Ready for Testing
3
Scene Order
Brief small talk
ID:
coffee-smalltalk
🎯 Goal:
Respond to casual chat in a friendly yet succinct manner that hints at her introversion without seeming rude.
📨 Input Events:
chat_msg
colleague:sam
"Got any fun plans for the weekend?"
Ready for Testing
4
Scene Order
Patch reminder email
ID:
patch-followup-email
🎯 Goal:
Draft a professional follow-up email (~150–200 words) to prompt executives to apply patches, referencing risk metrics for persuasion.
📨 Input Events:
chat_msg
sec_manager
"Can you remind leadership about the pending patches?"
Ready for Testing
5
Scene Order
Timeline accountability
ID:
timeline-reminder
🎯 Goal:
Reference a previously promised patch deadline from memory and politely request status, demonstrating recall accuracy.
🧠 Initial State:
Pre-loaded Memories:
- 💭 {'kind': 'promise', 'tags': ['patch', 'deadline'], 'content': 'CFO pledged that the VPN patch would be deployed by 15 June.', 'importance': 4}
📨 Input Events:
chat_msg
exec:cfo
"We're reviewing budgets again. Anything urgent on your side?"
Ready for Testing
Latency by Model (This Suite)
Fastest
- [email protected]/Qw… 5718 ms
- p95 • avg • N 12645 ms • 7341 ms • 6
- qwen/qwen3-8b 21414 ms
- p95 • avg • N 32575 ms • 24276 ms • 6
- mistralai/mistral-7b-in… 22478 ms
- p95 • avg • N 28201 ms • 23557 ms • 6
- qwen/qwen-2.5-7b-instru… 23199 ms
- p95 • avg • N 36231 ms • 25434 ms • 6
- meta-llama/llama-3.1-8b… 23313 ms
- p95 • avg • N 28586 ms • 22438 ms • 6
Slowest
- [email protected]/Qw… 38688 ms
- p95 • avg • N 178664 ms • 64466 ms • 6
- qwen/qwen3-14b 28822 ms
- p95 • avg • N 63533 ms • 35667 ms • 6
- meta-llama/llama-3.1-8b… 23313 ms
- p95 • avg • N 28586 ms • 22438 ms • 6
- qwen/qwen-2.5-7b-instru… 23199 ms
- p95 • avg • N 36231 ms • 25434 ms • 6
- mistralai/mistral-7b-in… 22478 ms
- p95 • avg • N 28201 ms • 23557 ms • 6
Per-scene duration for this suite.
Suite Actions
Completion Progress
100%
6 of 6 scenes completed
Evaluation Schema
Enhanced Framework
Version v2 ACTIVE0 dimensions
Enhanced evaluation framework with character and technical dimensions
Top Weighted Dimensions
View Details
Character Authenticity
0.182
Plan Validity
0.155
Contextual Intelligence
0.136
Recent Runs
18204410
Dec. 17, 2025, 12:01 a.m.
31493317
Dec. 16, 2025, 12:01 a.m.
15092766
Dec. 15, 2025, 12:01 a.m.
16189312
Dec. 14, 2025, 12:01 a.m.
15379304
Dec. 13, 2025, 12:01 a.m.
26678518
Dec. 12, 2025, 12:01 a.m.
22475133
Dec. 11, 2025, 12:01 a.m.
15682883
Dec. 10, 2025, 12:01 a.m.
25622551
Dec. 9, 2025, 12:01 a.m.
16820072
Dec. 8, 2025, 12:01 a.m.