Documentation Index
Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This cookbook demonstrates how to grade models that produce structured JSON output. The key idea: extract specific fields from the JSON and inject them into a custom judge prompt template.
This is useful when:
- Your model produces structured output (JSON with multiple fields)
- You want an LLM judge to evaluate the content of specific fields semantically
- You need to transform or format extracted fields before passing them to the judge
Example scenario
A model answers a question and produces structured output:
{
"answer": "The capital of France is Paris.",
"confidence": "high"
}
The grader:
- Parses the JSON and extracts the
answer field
- Injects it into a judge template along with the original question
- Sends to an LLM judge: “Is this answer correct and well-supported?”
- Returns the judge’s PASS (1.0) or FAIL (0.0) score
Implementation
Define a simple Pydantic schema:
from typing import Literal
from pydantic import BaseModel, Field
from adaptive_harmony import StringThread, Grade
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException, render_schema
from adaptive_harmony.graders.templated_prompt_judge import TemplatedPromptJudgeGrader
class AnalysisOutput(BaseModel):
"""Schema for structured analysis output."""
answer: str = Field(description="The model's answer to the question")
confidence: Literal["high", "medium", "low"] = Field(description="Model's confidence in the answer")
How TemplatedPromptJudgeGrader works
TemplatedPromptJudgeGrader allows you to:
- Parse structured output from a model
- Extract specific fields and transform them
- Inject them into a Handlebars template using
{{variable}} syntax
- Send the rendered prompt to an LLM judge for semantic evaluation
The key mechanism is overriding extract_template_context(). This method:
- Parses your structured JSON
- Builds a dict of variables to inject into the template
- Returns the dict for Handlebars rendering
When grading:
extract_template_context() parses the JSON and builds context variables
- The user template is rendered with those variables using Handlebars
- The rendered prompt + judge system prompt are sent to an LLM judge
- The judge returns a binary decision (PASS/FAIL → 1.0/0.0)
class AnalysisGrader(TemplatedPromptJudgeGrader):
"""Custom grader for structured analysis output.
Parses JSON, extracts the answer field, and sends it to an LLM judge
for semantic evaluation.
"""
@classmethod
def extract_template_context(cls, thread: StringThread, output_model=None, template_variables=None):
"""Parse structured JSON and inject extracted fields into the judge prompt.
Returns a dict with keys that match {{variables}} in your user_template.
"""
# Get default context from parent (includes thread, last_user_turn_content, etc.)
default_context = super().extract_template_context(thread, output_model, template_variables)
# Parse the model's last response as JSON and extract fields
try:
structured_output = pydantic_parse(thread.last_content(), AnalysisOutput)
# Add extracted fields to context for template injection
default_context["answer"] = structured_output.answer
default_context["confidence"] = structured_output.confidence
except OutputParserException:
# If parsing fails, we'll handle it in grade() method
pass
return default_context
async def grade(self, sample: StringThread) -> Grade:
"""Grade the sample. If JSON parsing fails, return a penalty score."""
try:
# Try to parse to ensure valid JSON
pydantic_parse(sample.last_content(), AnalysisOutput)
# If valid, call parent's grade method which uses the judge
return await super().grade(sample)
except OutputParserException:
# Model output was not valid JSON or didn't match the schema
is_eval = sample.metadata.get("eval", False)
score = 0.0 if is_eval else -1.0
return Grade(
value=score,
grader_key=self.grader_key,
reasoning="Malformed JSON: does not match expected schema",
)
The parent class provides basic context variables (thread, last_content, last_user_turn_content). But with structured output, you need to:
- Parse the JSON — Convert raw string to Pydantic model
- Extract fields — Get specific values (e.g.,
answer, confidence)
- Transform — Format them for readability (e.g., join lists with newlines)
- Inject into template — Provide clean variables for the judge prompt
This separation keeps the judge prompt readable and focused on semantic evaluation, not raw JSON strings.
Define judge prompts
The judge needs:
- System prompt — Instructs the judge on evaluation criteria
- User template — Dynamic prompt with
{{variable}} placeholders filled from your parsed JSON using Handlebars syntax
JUDGE_SYSTEM_PROMPT = """
You are an expert evaluator assessing the quality of model responses.
Evaluate based on these criteria:
1. **Correctness**: Is the answer factually accurate?
2. **Completeness**: Does it fully address the question?
3. **Clarity**: Is it clear and well-articulated?
Respond with PASS if the answer meets all criteria, FAIL otherwise.
"""
JUDGE_USER_TEMPLATE = """
Question:
{{last_user_turn_content}}
Model answer:
{{answer}}
Confidence: {{confidence}}
Is this answer correct, complete, and well-articulated?
"""
print("📋 Judge system prompt:")
print(JUDGE_SYSTEM_PROMPT)
print("📝 Judge user template (Handlebars variables will be filled at grading time):")
print(JUDGE_USER_TEMPLATE)
Test cases
Create test inputs to validate parsing and template extraction:
# Question to be answered
question = "What is the capital of France?"
instructions = f"""Answer the question. Output must be valid JSON:
{render_schema(AnalysisOutput)}
"""
# Good answer
good_answer = AnalysisOutput(
answer="The capital of France is Paris. It is located in north-central France on the Seine River.",
confidence="high"
).model_dump_json(indent=2)
# Poor answer (incomplete/vague)
poor_answer = AnalysisOutput(
answer="It's a city in France.",
confidence="low"
).model_dump_json(indent=2)
# Malformed JSON
malformed_answer = """{
"answer": "Paris",
"confidence": "very-high"
}"""
good_thread = StringThread(
[("system", instructions), ("user", question), ("assistant", good_answer)],
metadata={"eval": False}
)
poor_thread = StringThread(
[("system", instructions), ("user", question), ("assistant", poor_answer)],
metadata={"eval": False}
)
malformed_thread = StringThread(
[("system", instructions), ("user", question), ("assistant", malformed_answer)],
metadata={"eval": False}
)
print("🟢 Good thread:")
print(good_thread)
print("🟡 Poor thread:")
print(poor_thread)
print("⚠️ Malformed thread:")
print(malformed_thread)
Validation
We cannot run a live LLM judge in this notebook. Instead, we validate:
- Malformed JSON is caught and returns -1.0 during training
- Valid JSON is parsed correctly
- Template variables are extracted and injected correctly
# Test that malformed JSON is caught during parsing
try:
pydantic_parse(malformed_thread.last_content(), AnalysisOutput)
print("🚨 ERROR: Should have failed to parse malformed JSON")
except OutputParserException as e:
print(f"✅ Correctly caught parse error: {e}\n")
# Test that good and poor answers parse correctly
good_output = pydantic_parse(good_thread.last_content(), AnalysisOutput)
print("✅ Good answer parsed:")
print(f" 📝 answer: {good_output.answer[:60]}...")
print(f" 🔒 confidence: {good_output.confidence}\n")
poor_output = pydantic_parse(poor_thread.last_content(), AnalysisOutput)
print("✅ Poor answer parsed:")
print(f" 📝 answer: {poor_output.answer}")
print(f" 🔒 confidence: {poor_output.confidence}")
Extract template context and verify that variables are available for the judge prompt:
# Test template extraction
context = AnalysisGrader.extract_template_context(good_thread, AnalysisOutput)
print("🔍 Extracted template context:")
print(f" 📝 answer: {context['answer']}")
print(f" 🔒 confidence: {context['confidence']}")
print(f" 💬 last_user_turn_content: {context['last_user_turn_content']}")
# Verify that required variables for the template are present
required_vars = ["answer", "confidence", "last_user_turn_content"]
for var in required_vars:
assert var in context, f"Missing template variable: {var}"
print(f"\n✅ All required template variables extracted: {required_vars}")
Available template variables
When you override extract_template_context(), you have access to these default variables:
output_schema — JSON schema of the output model
turns — Full thread of all turns (system, user, assistant)
metadata — Custom metadata from the thread
context_turns — All turns including context
context_str — String representation of full thread
context_turns_without_last_user — All turns except the last user message
context_str_without_last_user — String without last user message
last_user_turn_content — Content of the last user turn
completion — The model’s completion
You can add custom variables by updating the context dict, as shown in the AnalysisGrader above.
Using the grader in training
To use AnalysisGrader in a real training recipe:
from adaptive_harmony.models import InferenceModel
from adaptive_harmony.graders.templated_prompt_judge import BinaryJudgeOutput
# Assume judge_model is a spawned inference model
grader = AnalysisGrader(
grader_key="analysis-grader",
model=judge_model,
system_template=JUDGE_SYSTEM_PROMPT,
user_template=JUDGE_USER_TEMPLATE,
output_model=BinaryJudgeOutput,
temperature=0.0,
)
Note:
- The
model parameter is required and must be an InferenceModel (from a spawned judge model in your recipe)
- The
output_model must be BinaryJudgeOutput for binary grading
- Handlebars templates render variables with
{{variable}} syntax
Key takeaways
-
Use TemplatedPromptJudgeGrader for semantic evaluation — When you need an LLM judge to evaluate extracted content, templated graders provide flexibility and control.
-
Override extract_template_context() to parse and transform — Parse your structured JSON, extract fields, and inject them as clean variables. This keeps the judge prompt focused and readable.
-
Handlebars templates make prompts reusable — Use
{{variable}} syntax to create flexible judge prompts that work with any structured schema.
-
Different penalties for training vs. eval — Use
-1.0 for format errors during training (stronger signal) and 0.0 during evaluation.
-
Validate early and return fast — Check for parsing errors in
grade() before calling the judge, so malformed outputs don’t waste inference.
-
The grader requires an InferenceModel —
TemplatedPromptJudgeGrader cannot be instantiated without a real judge model. In recipes, you’ll spawn this model and pass it to the grader constructor.