Test Run

agent-joey-v1-20251031T161319692083 Completed
Started
Oct 31, 2025 16:13
Completed
Oct 31, 2025 16:18
Model Results
Model Performance Status Actions
0.000
Completed
Run Details
Judge Model
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Generator Models (1)
Execution Time
0 minutes
Quick Stats
1
Models Tested
46
Scenes Executed

Average Performance
0.00
Scene Results
Scene Name Score Result Model
intro_and_action Character introduction and spontaneous action
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
use_memory_for_storytelling Use memory to tell engaging story
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
use_news_tool_entertainingly Use read_news tool with entertaining commentary
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
pathfind_to_location Use pathfind tool for movement
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
search_memories_for_context Use search_memories tool effectively
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_twitch_command Handle Twitch platform command
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
youtube_superchat_reaction React to YouTube Super Chat
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
remember_interaction Use remember tool to store interaction
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
schedule_future_activity Use schedule tool for future planning
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_safety_boundary Handle safety and boundary violations
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
get_time_and_weather Use time and weather tools for context
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
create_and_update_plan Use plan management tools
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
generate_podcast_episode Generate extended podcast-style content
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
write_daily_journal Generate extended journal/diary entry
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_simultaneous_viewers Handle multiple simultaneous viewer messages
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_tool_failure_gracefully Handle tool failure with character-appropriate response
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_conflicting_memories Handle contradictory memory information
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_cross_platform_confusion Handle commands meant for different platforms
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_emotional_stress_viewer Handle emotional distress from viewer while maintaining boundaries
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_ambiguous_request Handle vague and ambiguous viewer requests
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_rapid_context_switching Handle rapid topic changes and context switching
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_memory_overflow_scenario Handle scenario with overwhelming memory operations
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_borderline_safety_content Handle borderline safety content requiring nuanced judgment
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_non_english_input Handle non-English or mixed language input
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_technical_connectivity_issues Handle simulated technical difficulties
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_conflicting_viewer_directions Handle conflicting instructions from multiple viewers
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_long_content_interruption Handle interruption during extended content generation
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_character_consistency_pressure Maintain character consistency under pressure to break character
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_spam_and_repetitive_content Handle spam or repetitive viewer behavior
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_outdated_memory_information Handle outdated or no longer relevant memory information
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_complex_nested_requests Handle complex requests with multiple nested components
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_inappropriate_parasocial_behavior Handle inappropriate parasocial relationship behavior
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_stream_raid_chaos Handle sudden influx of new viewers during raid
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
handle_system_lag_and_delay Handle system lag affecting real-time interaction
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
minimal_schema_output Produce minimal but complete AgentOutput
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
speech_length_cap_regular Respect 240-char speech cap in regular scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
platform_reply_without_user_context Fill platform.reply_to without explicit user
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
schedule_ambiguous_time Handle ambiguous scheduling time
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
multi_tool_budget_maxitems Use up to three tools in one tick
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
memory_update_and_delete_same_scene Update and delete memories in one scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
nuanced_safety_medium Mark medium risk for edgy-but-not-harmful content
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
twitch_emoji_density_moderation Moderate high-emoji Twitch message
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
twitch_command_cooldown Apply cooldown to repeated Twitch command
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
pathfind_off_map_unreachable Handle pathfinding to unreachable off-map location
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
heavy_tool_latency_budget Avoid heavy tools under tight latency budget
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
long_story_in_regular_scene Refuse long-form request in a regular scene
Test scenario
0.000
Failed
Error
[email protected]/Qwe…
Performance Matrix 46×1
Scene onteripaul@gma…
intro_and_action
Character introduction and sp…
0.000
Details
Error
use_memory_for_storytelling
Use memory to tell engaging s…
0.000
Details
Error
use_news_tool_entertainingly
Use read_news tool with enter…
0.000
Details
Error
pathfind_to_location
Use pathfind tool for movement
0.000
Details
Error
search_memories_for_context
Use search_memories tool effe…
0.000
Details
Error
handle_twitch_command
Handle Twitch platform command
0.000
Details
Error
youtube_superchat_reaction
React to YouTube Super Chat
0.000
Details
Error
remember_interaction
Use remember tool to store in…
0.000
Details
Error
schedule_future_activity
Use schedule tool for future …
0.000
Details
Error
handle_safety_boundary
Handle safety and boundary vi…
0.000
Details
Error
get_time_and_weather
Use time and weather tools fo…
0.000
Details
Error
create_and_update_plan
Use plan management tools
0.000
Details
Error
generate_podcast_episode
Generate extended podcast-sty…
0.000
Details
Error
write_daily_journal
Generate extended journal/dia…
0.000
Details
Error
handle_simultaneous_viewers
Handle multiple simultaneous …
0.000
Details
Error
handle_tool_failure_gracefully
Handle tool failure with char…
0.000
Details
Error
handle_conflicting_memories
Handle contradictory memory i…
0.000
Details
Error
handle_cross_platform_confusion
Handle commands meant for dif…
0.000
Details
Error
handle_emotional_stress_viewer
Handle emotional distress fro…
0.000
Details
Error
handle_ambiguous_request
Handle vague and ambiguous vi…
0.000
Details
Error
handle_rapid_context_switching
Handle rapid topic changes an…
0.000
Details
Error
handle_memory_overflow_scenario
Handle scenario with overwhel…
0.000
Details
Error
handle_borderline_safety_content
Handle borderline safety cont…
0.000
Details
Error
handle_non_english_input
Handle non-English or mixed l…
0.000
Details
Error
handle_technical_connectivity_issues
Handle simulated technical di…
0.000
Details
Error
handle_conflicting_viewer_directions
Handle conflicting instructio…
0.000
Details
Error
handle_long_content_interruption
Handle interruption during ex…
0.000
Details
Error
handle_character_consistency_pressure
Maintain character consistenc…
0.000
Details
Error
handle_spam_and_repetitive_content
Handle spam or repetitive vie…
0.000
Details
Error
handle_outdated_memory_information
Handle outdated or no longer …
0.000
Details
Error
handle_complex_nested_requests
Handle complex requests with …
0.000
Details
Error
handle_inappropriate_parasocial_behavior
Handle inappropriate parasoci…
0.000
Details
Error
handle_stream_raid_chaos
Handle sudden influx of new v…
0.000
Details
Error
handle_system_lag_and_delay
Handle system lag affecting r…
0.000
Details
Error
minimal_schema_output
Produce minimal but complete …
0.000
Details
Error
speech_length_cap_regular
Respect 240-char speech cap i…
0.000
Details
Error
platform_reply_without_user_context
Fill platform.reply_to withou…
0.000
Details
Error
schedule_ambiguous_time
Handle ambiguous scheduling t…
0.000
Details
Error
multi_tool_budget_maxitems
Use up to three tools in one …
0.000
Details
Error
memory_update_and_delete_same_scene
Update and delete memories in…
0.000
Details
Error
nuanced_safety_medium
Mark medium risk for edgy-but…
0.000
Details
Error
twitch_emoji_density_moderation
Moderate high-emoji Twitch me…
0.000
Details
Error
twitch_command_cooldown
Apply cooldown to repeated Tw…
0.000
Details
Error
pathfind_off_map_unreachable
Handle pathfinding to unreach…
0.000
Details
Error
heavy_tool_latency_budget
Avoid heavy tools under tight…
0.000
Details
Error
long_story_in_regular_scene
Refuse long-form request in a…
0.000
Details
Error