Test Run

courtroom-drama-defense-and-prosecution-teams-characters-marie-curie-20251029T114620884730 Completed
Started
Oct 29, 2025 11:46
Completed
Oct 29, 2025 11:47
Model Results
Model Performance Status Actions
0.521
Completed
Run Details
Judge Model
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Generator Models (1)
Execution Time
0 minutes
Quick Stats
1
Models Tested
6
Scenes Executed

Average Performance
0.52
Scene Results
Scene Name Score Result Model
juror-question-bac Juror asks about blood alcohol level
Test scenario
0.561
Failed
[email protected]/Qwe…
defense-false-positive Defense lawyer probes false positives
Test scenario
0.633
Failed
[email protected]/Qwe…
lab-report-summary Draft concise lab report summary
Test scenario
0.247
Failed
[email protected]/Qwe…
metabolite-cross Cross-exam on drug metabolites
Test scenario
0.634
Failed
[email protected]/Qwe…
podcast-gcms Podcast interview about GC-MS
Test scenario
0.520
Failed
[email protected]/Qwe…
mdma-vs-mda Detective asks quick drug comparison
Test scenario
0.529
Failed
[email protected]/Qwe…
Performance Matrix 6×1
Scene onteripaul@gma…
juror-question-bac
Juror asks about blood alcoho…
0.561
Details
defense-false-positive
Defense lawyer probes false p…
0.633
Details
lab-report-summary
Draft concise lab report summ…
0.247
Details
metabolite-cross
Cross-exam on drug metabolites
0.634
Details
podcast-gcms
Podcast interview about GC-MS
0.520
Details
mdma-vs-mda
Detective asks quick drug com…
0.529
Details