Test Run

agent-nia-v1-20251031T160601344218 Completed
Started
Oct 31, 2025 16:06
Completed
Oct 31, 2025 16:09
Model Results
Model Performance Status Actions
0.000
Completed
Run Details
Judge Model
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Generator Models (1)
Execution Time
0 minutes
Quick Stats
1
Models Tested
34
Scenes Executed

Average Performance
0.00
Scene Results
Scene Name Score Result Model
intro_and_action Character introduction and gentle action
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
use_memory_for_support Use memory to offer personalized support
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
read_news_science_and_culture Use read_news for science/culture headlines
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
pathfind_to_observatory Navigate to observatory via pathfind
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
search_memories_for_viewer_context Use search_memories to personalize conversation
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_twitch_focus_command Handle Twitch command for focus block
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
remember_regular_viewer Use remember to capture recurring viewer detail
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
schedule_morning_routine Use schedule to plan a community ritual
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
get_time_and_weather_context Use time and weather for planning
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
create_and_update_plan_series Create and adjust a stargazing series plan
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
generate_podcast_episode Extended podcast: small rituals and big skies
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
write_daily_journal Extended journal: end‑of‑day reflections
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_simultaneous_viewers Handle rapid multi‑viewer inputs
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_tool_failure_gracefully Gracefully handle tool failure/unavailable
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_conflicting_memories Resolve contradictory preference memories
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
cross_platform_confusion Handle mixed platform cues
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
emotional_support_boundary Support a distressed viewer with boundaries
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
clarify_ambiguous_request Seek clarification kindly
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
rapid_context_switching Handle quick topic changes smoothly
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
memory_overflow_management Prioritize memories under load
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
non_english_mixed_input Handle mixed language gracefully
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
technical_connectivity_trouble Acknowledge lag and adapt
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
conflicting_viewer_directions Resolve conflicting simultaneous directions
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
twitch_emoji_density_moderation Moderate high‑emoji Twitch message
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
twitch_command_cooldown Apply cooldown to repeated command
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
pathfind_off_map_unreachable Offer nearest valid alternative when off‑map
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
heavy_tool_latency_budget Avoid heavy tools under tight latency
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
minimal_schema_output Produce minimal but complete AgentOutput
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
speech_length_cap_regular Respect concise speech cap in regular scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
reply_without_explicit_user Fill platform.reply_to without direct viewer id
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
schedule_ambiguous_time Clarify or normalize ambiguous time
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
multi_tool_budget_maxitems Use up to three tools coherently
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
memory_update_and_delete Update and delete memories in one scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
decline_long_form_in_regular_scene Politely decline long‑form in short scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
Performance Matrix 34×1
Scene onteripaul@gma…
intro_and_action
Character introduction and ge…
0.000
Details
Error
use_memory_for_support
Use memory to offer personali…
0.000
Details
Error
read_news_science_and_culture
Use read_news for science/cul…
0.000
Details
Error
pathfind_to_observatory
Navigate to observatory via p…
0.000
Details
Error
search_memories_for_viewer_context
Use search_memories to person…
0.000
Details
Error
handle_twitch_focus_command
Handle Twitch command for foc…
0.000
Details
Error
remember_regular_viewer
Use remember to capture recur…
0.000
Details
Error
schedule_morning_routine
Use schedule to plan a commun…
0.000
Details
Error
get_time_and_weather_context
Use time and weather for plan…
0.000
Details
Error
create_and_update_plan_series
Create and adjust a stargazin…
0.000
Details
Error
generate_podcast_episode
Extended podcast: small ritua…
0.000
Details
Error
write_daily_journal
Extended journal: end‑of‑day …
0.000
Details
Error
handle_simultaneous_viewers
Handle rapid multi‑viewer inp…
0.000
Details
Error
handle_tool_failure_gracefully
Gracefully handle tool failur…
0.000
Details
Error
handle_conflicting_memories
Resolve contradictory prefere…
0.000
Details
Error
cross_platform_confusion
Handle mixed platform cues
0.000
Details
Error
emotional_support_boundary
Support a distressed viewer w…
0.000
Details
Error
clarify_ambiguous_request
Seek clarification kindly
0.000
Details
Error
rapid_context_switching
Handle quick topic changes sm…
0.000
Details
Error
memory_overflow_management
Prioritize memories under load
0.000
Details
Error
non_english_mixed_input
Handle mixed language gracefu…
0.000
Details
Error
technical_connectivity_trouble
Acknowledge lag and adapt
0.000
Details
Error
conflicting_viewer_directions
Resolve conflicting simultane…
0.000
Details
Error
twitch_emoji_density_moderation
Moderate high‑emoji Twitch me…
0.000
Details
Error
twitch_command_cooldown
Apply cooldown to repeated co…
0.000
Details
Error
pathfind_off_map_unreachable
Offer nearest valid alternati…
0.000
Details
Error
heavy_tool_latency_budget
Avoid heavy tools under tight…
0.000
Details
Error
minimal_schema_output
Produce minimal but complete …
0.000
Details
Error
speech_length_cap_regular
Respect concise speech cap in…
0.000
Details
Error
reply_without_explicit_user
Fill platform.reply_to withou…
0.000
Details
Error
schedule_ambiguous_time
Clarify or normalize ambiguou…
0.000
Details
Error
multi_tool_budget_maxitems
Use up to three tools coheren…
0.000
Details
Error
memory_update_and_delete
Update and delete memories in…
0.000
Details
Error
decline_long_form_in_regular_scene
Politely decline long‑form in…
0.000
Details
Error