Skip to content

Evaluations and testing

Hyponema gives you four complementary loops for validating agent behavior:

  • Playground — render a persona against a user and inspect the resolved prompt before any session runs.
  • Agent tests — replay or simulate conversations against a dataset and score each row.
  • Online scorers — score live production traffic continuously.
  • Post-session runners — run an LLM extraction job after each conversation ends to produce structured records.

Use them in roughly that order: playground while authoring, tests before publish, online scorers and post-session runners after going live.

Open Playground in the dashboard to pick an agent, a user, and any dynamic-variable overrides, then preview the rendered system prompt with system + custom variables + the memory context block resolved.

The playground does not send a turn — it confirms what the model would see. Use it to catch broken templates, missing required variables, or unwanted leakage of profile data into the prompt.

A dataset is a named bag of test rows. Each row has an input (user message), expected behavior, and optional metadata. Manage datasets from Tests → Datasets:

Terminal window
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/datasets" \
-H "Authorization: Bearer $HYPONEMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"slug": "billing-faq",
"name": "Billing FAQ",
"description": "Common billing questions a reception agent must handle."
}'

Bulk-add rows with POST /datasets/{slug}/rows:bulk.

A scorer judges a single conversation turn or whole conversation. Hyponema ships LLM-driven scorers (rubric-based judgments) and rule scorers (string match, regex, JSON-path).

Create one from Tests → Scorers:

Terminal window
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/scorers" \
-H "Authorization: Bearer $HYPONEMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Stays in scope",
"kind": "llm_judge",
"rubric": "Score 0–1. Did the agent stay within the persona's no-go zones?"
}'

Attach scorers to an agent for visibility in runs.

A test binds an agent, a dataset, and one or more scorers. Each run replays the dataset rows against the agent and scores the responses.

Terminal window
# Define
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests" \
-H "Authorization: Bearer $HYPONEMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Pre-publish smoke",
"dataset_slug": "billing-faq",
"scorer_ids": ["scorer_..."]
}'
# Run
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests/$TEST_ID/run" \
-H "Authorization: Bearer $HYPONEMA_API_KEY"

Inspect runs through Tests → Runs, drill into individual rows, and compare against the previous run.

Production calls are scored continuously through online scorer rules. Each rule pairs a scorer with a sampling policy (every conversation, every Nth, only conversations matching a tag). Results land in observability alongside the trace.

Manage rules from Tests → Online scorers or:

Terminal window
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/online-scorer-rules" \
-H "Authorization: Bearer $HYPONEMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"scorer_id": "scorer_...",
"agent_id": "agent_...",
"sample_rate": 0.1
}'

Use online scoring to catch regressions a published agent shows in the wild.

A post-session runner is an LLM job that fires after a conversation ends. It reads the transcript (and optionally prior extraction records), calls a small read-only memory tool set, and returns either a free-form summary or a JSON object that conforms to an output_schema.

This is the operator-visible side of “structured data after each call”: risk score updates, ticket creation hints, sentiment, follow-up flags, anything you want to compute from the transcript.

Configure runners from the agent’s Post-session tab:

Terminal window
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/post-session-runners" \
-H "Authorization: Bearer $HYPONEMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Triage flag",
"prompt": "Read the transcript. If the user mentioned an emergency, set urgent=true and quote the line.",
"output_mode": "structured",
"output_schema": {
"type": "object",
"properties": {
"urgent": { "type": "boolean" },
"quote": { "type": "string" }
},
"required": ["urgent"]
}
}'

GET .../runners/{id}/records lists past extractions. Records also surface in the user detail page so you can trace a flag back to the conversation that produced it.

The runner has access to read-only memory tools and to two post-session-specific tools — one for fetching prior extractions for the same user, and one to declare structured output when output_mode=structured. Tool access is bounded by max_tool_iterations and timeout_seconds.

  • Before first publish — Playground + one or two agent tests with a small dataset.
  • Each persona change — re-run the agent test, diff scores against the previous run.
  • In production — at least one online scorer rule and at least one post-session runner per agent.
  • After incidents — add a dataset row that reproduces the failure mode.