Overview
This cookbook demonstrates how to grade models that produce structured JSON output. The key idea: extract specific fields from the JSON and inject them into a custom judge prompt template. This is useful when:- Your model produces structured output (JSON with multiple fields)
- You want an LLM judge to evaluate the content of specific fields semantically
- You need to transform or format extracted fields before passing them to the judge
Example scenario
A model answers a question and produces structured output:- Parses the JSON and extracts the
answerfield - Injects it into a judge template along with the original question
- Sends to an LLM judge: βIs this answer correct and well-supported?β
- Returns the judgeβs PASS (1.0) or FAIL (0.0) score
Implementation
Define a simple Pydantic schema:How TemplatedPromptJudgeGrader works
TemplatedPromptJudgeGrader allows you to:
- Parse structured output from a model
- Extract specific fields and transform them
- Inject them into a Handlebars template using
{{variable}}syntax - Send the rendered prompt to an LLM judge for semantic evaluation
extract_template_context(). This method:
- Parses your structured JSON
- Builds a dict of variables to inject into the template
- Returns the dict for Handlebars rendering
extract_template_context()parses the JSON and builds context variables- The user template is rendered with those variables using Handlebars
- The rendered prompt + judge system prompt are sent to an LLM judge
- The judge returns a binary decision (PASS/FAIL β 1.0/0.0)
Custom grader with extract_template_context override
Why override extract_template_context()?
The parent class provides basic context variables (thread, last_content, last_user_turn_content). But with structured output, you need to:
- Parse the JSON β Convert raw string to Pydantic model
- Extract fields β Get specific values (e.g.,
answer,confidence) - Transform β Format them for readability (e.g., join lists with newlines)
- Inject into template β Provide clean variables for the judge prompt
Define judge prompts
The judge needs:- System prompt β Instructs the judge on evaluation criteria
- User template β Dynamic prompt with
{{variable}}placeholders filled from your parsed JSON using Handlebars syntax
Test cases
Create test inputs to validate parsing and template extraction:Validation
We cannot run a live LLM judge in this notebook. Instead, we validate:- Malformed JSON is caught and returns -1.0 during training
- Valid JSON is parsed correctly
- Template variables are extracted and injected correctly
Template extraction
Extract template context and verify that variables are available for the judge prompt:Available template variables
When you overrideextract_template_context(), you have access to these default variables:
output_schemaβ JSON schema of the output modelturnsβ Full thread of all turns (system, user, assistant)metadataβ Custom metadata from the threadcontext_turnsβ All turns including contextcontext_strβ String representation of full threadcontext_turns_without_last_userβ All turns except the last user messagecontext_str_without_last_userβ String without last user messagelast_user_turn_contentβ Content of the last user turncompletionβ The modelβs completion
AnalysisGrader above.
Using the grader in training
To useAnalysisGrader in a real training recipe:
- The
modelparameter is required and must be anInferenceModel(from a spawned judge model in your recipe) - The
output_modelmust beBinaryJudgeOutputfor binary grading - Handlebars templates render variables with
{{variable}}syntax
Key takeaways
- Use TemplatedPromptJudgeGrader for semantic evaluation β When you need an LLM judge to evaluate extracted content, templated graders provide flexibility and control.
- Override extract_template_context() to parse and transform β Parse your structured JSON, extract fields, and inject them as clean variables. This keeps the judge prompt focused and readable.
-
Handlebars templates make prompts reusable β Use
{{variable}}syntax to create flexible judge prompts that work with any structured schema. -
Different penalties for training vs. eval β Use
-1.0for format errors during training (stronger signal) and0.0during evaluation. -
Validate early and return fast β Check for parsing errors in
grade()before calling the judge, so malformed outputs donβt waste inference. -
The grader requires an InferenceModel β
TemplatedPromptJudgeGradercannot be instantiated without a real judge model. In recipes, youβll spawn this model and pass it to the grader constructor.

