Skip to content

Code Graders

Code graders are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Code graders communicate via stdin/stdout JSON:

Input (stdin):

{
"input": "What is 15 + 27?",
"criteria": "Correctly calculates 15 + 27 = 42",
"output": "The answer is 42.",
"expected_output": "42"
}
**Output (stdout):**
```json
{
"score": 1.0,
"assertions": [
{ "text": "Answer contains correct value (42)", "passed": true }
]
}
Output FieldTypeDescription
scorenumber0.0 to 1.0
assertionsArray<{ text, passed, evidence? }>Per-aspect results with verdict and optional evidence
validators/check_answer.py
import json, sys
data = json.load(sys.stdin)
output_text = data.get("output", "")
assertions = []
if "42" in output_text:
assertions.append({"text": "Output contains correct value (42)", "passed": True})
else:
assertions.append({"text": "Output does not contain expected value (42)", "passed": False})
passed = sum(1 for a in assertions if a["passed"])
score = passed / len(assertions) if assertions else 0.0
print(json.dumps({
"score": score,
"assertions": assertions,
}))
validators/check_answer.ts
import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const outputText: string = data.output ?? "";
const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes("42")) {
assertions.push({ text: "Output contains correct value (42)", passed: true });
} else {
assertions.push({ text: "Output does not contain expected value (42)", passed: false });
}
const passed = assertions.filter(a => a.passed).length;
console.log(JSON.stringify({
score: passed > 0 ? 1.0 : 0.0,
assertions,
reasoning: `Passed ${passed} check(s)`,
}));
assertions:
- name: my_validator
type: code-grader
command: [./validators/check_answer.py]

The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeGrader to skip boilerplate:

#!/usr/bin/env bun
import { defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(({ output, criteria }) => {
const outputText = output?.[0]?.content ?? '';
const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes(criteria)) {
assertions.push({ text: 'Output matches expected outcome', passed: true });
} else {
assertions.push({ text: 'Output does not match expected outcome', passed: false });
}
const passed = assertions.filter(a => a.passed).length;
return {
score: assertions.length === 0 ? 0 : passed / assertions.length,
assertions,
};
});

SDK exports: defineCodeGrader, Message, ToolCall, TraceSummary, CodeGraderInput, CodeGraderResult

Code graders can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Add a target block to the grader config:

assertions:
- name: contextual-precision
type: code-grader
command: [bun, scripts/contextual-precision.ts]
target:
max_calls: 10 # Default: 50

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(async ({ input, output }) => {
const inputText = input?.[0]?.content ?? '';
const outputText = output?.[0]?.content ?? '';
const target = createTargetClient();
if (!target) return { score: 0, assertions: [{ text: 'Target not configured', passed: false }] };
const response = await target.invoke({
question: `Is this relevant to: ${inputText}? Response: ${outputText}`,
systemPrompt: 'Respond with JSON: { "relevant": true/false }'
});
const result = JSON.parse(response.rawText ?? '{}');
return { score: result.relevant ? 1.0 : 0.0 };
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

VariableDescription
AGENTV_TARGET_PROXY_URLLocal proxy URL
AGENTV_TARGET_PROXY_TOKENBearer token for authentication

Beyond the basic text fields (input, output, expected_output, criteria), code graders receive additional structured context:

FieldTypeDescription
inputstring | Message[]Input text or full resolved input message array
outputstring | Message[]Agent output text or full execution trace with tool calls
expected_outputstring | Message[]Expected output text or expected agent behavior including tool calls
input_filesstring[]Paths to input files referenced in the eval
traceTraceSummaryLightweight execution metrics (tool calls, errors)
token_usage{input, output}Token consumption
cost_usdnumberEstimated cost in USD
duration_msnumberTotal execution duration
start_timestringISO timestamp of first event
end_timestringISO timestamp of last event
file_changesstring | nullUnified diff of workspace file changes (populated when workspace is configured; includes files at workspace root, changes inside nested repos, and Copilot session-state artifacts)
workspace_pathstring | nullAbsolute path to the temp workspace directory (populated when workspace is configured)
{
"event_count": 5,
"tool_names": ["fetch", "search"],
"tool_calls_by_name": { "search": 2, "fetch": 1 },
"error_count": 0,
"llm_call_count": 2
}
FieldTypeDescription
event_countnumberTotal tool invocations
tool_namesstring[]Unique tool names used
tool_calls_by_nameRecord<string, number>Count per tool
error_countnumberFailed tool calls
llm_call_countnumberNumber of LLM calls (assistant messages)

Use expected_output for retrieval context in RAG evals (tool calls with outputs) and output for the actual agent execution trace from live runs.

When workspace is configured in the eval YAML (via workspace.template, workspace.path, or workspace.repos), code graders receive the workspace path in two ways:

  1. JSON payload: workspace_path field in the stdin input
  2. Environment variable: AGENTV_WORKSPACE_PATH

This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.

file_changes is a unified diff built from two sources, merged in order:

  1. Git baseline: git diff against a baseline commit taken before the agent ran. Captures edits, new files at workspace root, and changes inside any nested git repos materialized via workspace.repos or set up via a before_all hook.
  2. Provider-reported artifacts: Copilot providers scan their session-state files/ directory after each run and append those as synthetic diffs. This surfaces files the agent wrote outside workspace_path entirely (e.g. ~/.copilot/session-state/<uuid>/files/).
#!/usr/bin/env bun
import { readFileSync } from "fs";
import { execFileSync } from "child_process";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const cwd = input.workspace_path;
const assertions: Array<{ text: string; passed: boolean }> = [];
// Stage 1: Install dependencies
try {
execFileSync("npm", ["install"], { cwd, stdio: "pipe" });
assertions.push({ text: "npm install passed", passed: true });
} catch { assertions.push({ text: "npm install failed", passed: false }); }
// Stage 2: Typecheck
try {
execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" });
assertions.push({ text: "typecheck passed", passed: true });
} catch { assertions.push({ text: "typecheck failed", passed: false }); }
// Stage 3: Run tests
try {
execFileSync("npm", ["test"], { cwd, stdio: "pipe" });
assertions.push({ text: "tests passed", passed: true });
} catch { assertions.push({ text: "tests failed", passed: false }); }
const passed = assertions.filter(a => a.passed).length;
console.log(JSON.stringify({
score: assertions.length > 0 ? passed / assertions.length : 0,
assertions,
}));
dataset.eval.yaml
workspace:
template: ./workspace-template # copied into a temp dir before each run
execution:
target: my_agent
tests:
- id: implement-feature
criteria: Agent implements the feature correctly
input: "Implement the TODO functions in src/index.ts"
assertions:
- name: functional-check
type: code-grader
command: [bun, scripts/functional-check.ts]

See examples/features/functional-grading/ for a complete working example.

ExampleWhat it demonstrates
examples/features/functional-grading/workspace_path — deploy-and-test with npm install + tsc + npm test
examples/features/file-changes/file_changes — edits, creates, and deletes captured via git baseline
examples/features/workspace-artifact/file_changes — new file generated by agent (CSV) captured via git baseline
examples/features/file-changes-with-repos/file_changes — workspace-root files AND changes inside nested repos both captured

Run a grader from .agentv/graders/ by name — no manual JSON piping required:

Terminal window
# Pass agent output and input directly
agentv eval assert rouge-score --agent-output "The fox jumps over the dog" --agent-input "Summarise this"
# Or pass a JSON file with { output, input } fields
agentv eval assert rouge-score --file result.json

The command:

  1. Discovers the grader script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}
  2. Passes { output, input, criteria } to the script via stdin
  3. Prints the grader’s JSON result to stdout
  4. Exits 0 if score >= 0.5, exit 1 otherwise

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits agentv eval assert instructions for code graders so external grading agents can run them directly.

Pipe JSON directly to the grader script for full control:

Terminal window
echo '{"input":"What is 2+2?","criteria":"4","output":"4","expected_output":"4"}' | python validators/check_answer.py