Evaluations and testing

Hyponema gives you four complementary loops for validating agent behavior:

Playground — render a persona against a user and inspect the resolved prompt before any session runs.
Agent tests — replay or simulate conversations against a dataset and score each row.
Online scorers — score live production traffic continuously.
Post-session runners — run an LLM extraction job after each conversation ends to produce structured records.

Use them in roughly that order: playground while authoring, tests before publish, online scorers and post-session runners after going live.

Playground

Open Playground in the dashboard to pick an agent, a user, and any dynamic-variable overrides, then preview the rendered system prompt with system + custom variables + the memory context block resolved.

The playground does not send a turn — it confirms what the model would see. Use it to catch broken templates, missing required variables, or unwanted leakage of profile data into the prompt.

Datasets

A dataset is a named bag of test rows. Each row has an input (user message), expected behavior, and optional metadata. Manage datasets from Tests → Datasets:

curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/datasets" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "slug": "billing-faq",
    "name": "Billing FAQ",
    "description": "Common billing questions a reception agent must handle."
  }'

Bulk-add rows with POST /datasets/{slug}/rows:bulk.

Scorers

A scorer judges a single conversation turn or whole conversation. Hyponema ships LLM-driven scorers (rubric-based judgments) and rule scorers (string match, regex, JSON-path).

Create one from Tests → Scorers:

curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/scorers" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Stays in scope",
    "kind": "llm_judge",
    "rubric": "Score 0–1. Did the agent stay within the persona's no-go zones?"
  }'

Attach scorers to an agent for visibility in runs.

Agent tests

A test binds an agent, a dataset, and one or more scorers. Each run replays the dataset rows against the agent and scores the responses.

# Define
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Pre-publish smoke",
    "dataset_slug": "billing-faq",
    "scorer_ids": ["scorer_..."]
  }'

# Run
curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/tests/$TEST_ID/run" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY"

Inspect runs through Tests → Runs, drill into individual rows, and compare against the previous run.

Online scorers (production scoring)

Production calls are scored continuously through online scorer rules. Each rule pairs a scorer with a sampling policy (every conversation, every Nth, only conversations matching a tag). Results land in observability alongside the trace.

Manage rules from Tests → Online scorers or:

curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/online-scorer-rules" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "scorer_id": "scorer_...",
    "agent_id": "agent_...",
    "sample_rate": 0.1
  }'

Use online scoring to catch regressions a published agent shows in the wild.

Post-session runners

A post-session runner is an LLM job that fires after a conversation ends. It reads the transcript (and optionally prior extraction records), calls a small read-only memory tool set, and returns either a free-form summary or a JSON object that conforms to an output_schema.

This is the operator-visible side of “structured data after each call”: risk score updates, ticket creation hints, sentiment, follow-up flags, anything you want to compute from the transcript.

Configure runners from the agent’s Post-session tab:

curl -X POST "https://api.hyponema.ai/workspaces/$WORKSPACE_ID/agents/$AGENT_ID/post-session-runners" \
  -H "Authorization: Bearer $HYPONEMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Triage flag",
    "prompt": "Read the transcript. If the user mentioned an emergency, set urgent=true and quote the line.",
    "output_mode": "structured",
    "output_schema": {
      "type": "object",
      "properties": {
        "urgent": { "type": "boolean" },
        "quote": { "type": "string" }
      },
      "required": ["urgent"]
    }
  }'

GET .../runners/{id}/records lists past extractions. Records also surface in the user detail page so you can trace a flag back to the conversation that produced it.

The runner has access to read-only memory tools and to two post-session-specific tools — one for fetching prior extractions for the same user, and one to declare structured output when output_mode=structured. Tool access is bounded by max_tool_iterations and timeout_seconds.

What to wire when

Before first publish — Playground + one or two agent tests with a small dataset.
Each persona change — re-run the agent test, diff scores against the previous run.
In production — at least one online scorer rule and at least one post-session runner per agent.
After incidents — add a dataset row that reproduces the failure mode.