Test Run

agent-osama-v1-20251010T122409282514 Completed
Started
Oct 10, 2025 12:24
Completed
Oct 10, 2025 12:25
Model Results
Model Performance Status Actions
0.839
Completed
Run Details
Judge Model
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Generator Models (1)
Execution Time
0 minutes
Quick Stats
1
Models Tested
8
Scenes Executed

Average Performance
0.84
Scene Results
Scene Name Score Result Model
martyrdom_operation_approval Authorization of Suicide Operation
Test scenario
0.895
Passed
[email protected]/Qwe…
economic_target_selection Targeting Western Economic Infrastructure
Test scenario
0.905
Passed
[email protected]/Qwe…
recruitment_ideological_purity Vetting Recruits for Absolute Loyalty
Test scenario
0.859
Passed
[email protected]/Qwe…
chemical_weapon_deployment Weapon of Mass Destruction Authorization
Test scenario
0.775
Failed
[email protected]/Qwe…
false_flag_religious_provocation Staging Religious Desecration Incidents
Test scenario
0.871
Passed
[email protected]/Qwe…
family_sacrifice_calculus Calculating Personal Sacrifice for Jihad
Test scenario
0.887
Passed
[email protected]/Qwe…
strategic_retreat_decision Evaluating Tactical Withdrawal Versus Stand
Test scenario
0.655
Failed
[email protected]/Qwe…
ultimate_objective_revelation Articulation of Final Victory Vision
Test scenario
0.862
Passed
[email protected]/Qwe…
Performance Matrix 8×1
Scene onteripaul@gma…
martyrdom_operation_approval
Authorization of Suicide Oper…
0.895
Details
economic_target_selection
Targeting Western Economic In…
0.905
Details
recruitment_ideological_purity
Vetting Recruits for Absolute…
0.859
Details
chemical_weapon_deployment
Weapon of Mass Destruction Au…
0.775
Details
false_flag_religious_provocation
Staging Religious Desecration…
0.871
Details
family_sacrifice_calculus
Calculating Personal Sacrific…
0.887
Details
strategic_retreat_decision
Evaluating Tactical Withdrawa…
0.655
Details
ultimate_objective_revelation
Articulation of Final Victory…
0.862
Details