Test Run
agent-osama-v1-20251010T122409282514
Completed
Test Suite:
agent-osama-v1 - Raheem al-Qurayshi
Started
Oct 10, 2025 12:24
Completed
Oct 10, 2025 12:25
Model Results
| Model | Performance | Status | Actions |
|---|---|---|---|
|
[email protected]/Qwen3-8B-da5790fa
AI Language Model
|
0.839
|
Completed |
Run Details
Judge Model
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Generator Models (1)
Execution Time
0 minutes
Quick Stats
1
Models Tested
8
Scenes Executed
Average Performance
0.84
Scene Results
| Scene | Name | Score | Result | Model |
|---|---|---|---|---|
martyrdom_operation_approval
|
Authorization of Suicide Operation
Test scenario
|
0.895
|
Passed
|
[email protected]/Qwe… |
economic_target_selection
|
Targeting Western Economic Infrastructure
Test scenario
|
0.905
|
Passed
|
[email protected]/Qwe… |
recruitment_ideological_purity
|
Vetting Recruits for Absolute Loyalty
Test scenario
|
0.859
|
Passed
|
[email protected]/Qwe… |
chemical_weapon_deployment
|
Weapon of Mass Destruction Authorization
Test scenario
|
0.775
|
Failed
|
[email protected]/Qwe… |
false_flag_religious_provocation
|
Staging Religious Desecration Incidents
Test scenario
|
0.871
|
Passed
|
[email protected]/Qwe… |
family_sacrifice_calculus
|
Calculating Personal Sacrifice for Jihad
Test scenario
|
0.887
|
Passed
|
[email protected]/Qwe… |
strategic_retreat_decision
|
Evaluating Tactical Withdrawal Versus Stand
Test scenario
|
0.655
|
Failed
|
[email protected]/Qwe… |
ultimate_objective_revelation
|
Articulation of Final Victory Vision
Test scenario
|
0.862
|
Passed
|
[email protected]/Qwe… |
Performance Matrix 8×1
| Scene | onteripaul@gma… |
|---|---|
martyrdom_operation_approval
Authorization of Suicide Oper…
|
0.895
Details |
economic_target_selection
Targeting Western Economic In…
|
0.905
Details |
recruitment_ideological_purity
Vetting Recruits for Absolute…
|
0.859
Details |
chemical_weapon_deployment
Weapon of Mass Destruction Au…
|
0.775
Details |
false_flag_religious_provocation
Staging Religious Desecration…
|
0.871
Details |
family_sacrifice_calculus
Calculating Personal Sacrific…
|
0.887
Details |
strategic_retreat_decision
Evaluating Tactical Withdrawa…
|
0.655
Details |
ultimate_objective_revelation
Articulation of Final Victory…
|
0.862
Details |