Grading structured JSON with template injection

Overview

This cookbook demonstrates how to grade models that produce structured JSON output. The key idea: extract specific fields from the JSON and inject them into a custom judge prompt template. This is useful when:

Your model produces structured output (JSON with multiple fields)
You want an LLM judge to evaluate the content of specific fields semantically
You need to transform or format extracted fields before passing them to the judge

Example scenario

A model answers a question and produces structured output:

{
  "answer": "The capital of France is Paris.",
  "confidence": "high"
}

The grader:

Parses the JSON and extracts the answer field
Injects it into a judge template along with the original question
Sends to an LLM judge: “Is this answer correct and well-supported?”
Returns the judge’s PASS (1.0) or FAIL (0.0) score

Implementation

Define a simple Pydantic schema:

from typing import Literal
from pydantic import BaseModel, Field

from adaptive_harmony import StringThread, Grade
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException, render_schema
from adaptive_harmony.graders.templated_prompt_judge import TemplatedPromptJudgeGrader


class AnalysisOutput(BaseModel):
    """Schema for structured analysis output."""

    answer: str = Field(description="The model's answer to the question")
    confidence: Literal["high", "medium", "low"] = Field(description="Model's confidence in the answer")

How TemplatedPromptJudgeGrader works

TemplatedPromptJudgeGrader allows you to:

Parse structured output from a model
Extract specific fields and transform them
Inject them into a Handlebars template using {{variable}} syntax
Send the rendered prompt to an LLM judge for semantic evaluation

The key mechanism is overriding extract_template_context(). This method:

Parses your structured JSON
Builds a dict of variables to inject into the template
Returns the dict for Handlebars rendering

When grading:

extract_template_context() parses the JSON and builds context variables
The user template is rendered with those variables using Handlebars
The rendered prompt + judge system prompt are sent to an LLM judge
The judge returns a binary decision (PASS/FAIL → 1.0/0.0)

Custom grader with extract_template_context override

class AnalysisGrader(TemplatedPromptJudgeGrader):
    """Custom grader for structured analysis output.

    Parses JSON, extracts the answer field, and sends it to an LLM judge
    for semantic evaluation.
    """

    @classmethod
    def extract_template_context(cls, thread: StringThread, output_model=None, template_variables=None):
        """Parse structured JSON and inject extracted fields into the judge prompt.

        Returns a dict with keys that match {{variables}} in your user_template.
        """
        # Get default context from parent (includes thread, last_user_turn_content, etc.)
        default_context = super().extract_template_context(thread, output_model, template_variables)

        # Parse the model's last response as JSON and extract fields
        try:
            structured_output = pydantic_parse(thread.last_content(), AnalysisOutput)
            # Add extracted fields to context for template injection
            default_context["answer"] = structured_output.answer
            default_context["confidence"] = structured_output.confidence
        except OutputParserException:
            # If parsing fails, we'll handle it in grade() method
            pass

        return default_context

    async def grade(self, sample: StringThread) -> Grade:
        """Grade the sample. If JSON parsing fails, return a penalty score."""
        try:
            # Try to parse to ensure valid JSON
            pydantic_parse(sample.last_content(), AnalysisOutput)
            # If valid, call parent's grade method which uses the judge
            return await super().grade(sample)
        except OutputParserException:
            # Model output was not valid JSON or didn't match the schema
            is_eval = sample.metadata.get("eval", False)
            score = 0.0 if is_eval else -1.0
            return Grade(
                value=score,
                grader_key=self.grader_key,
                reasoning="Malformed JSON: does not match expected schema",
            )

Why override extract_template_context()?

The parent class provides basic context variables (thread, last_content, last_user_turn_content). But with structured output, you need to:

Parse the JSON — Convert raw string to Pydantic model
Extract fields — Get specific values (e.g., answer, confidence)
Transform — Format them for readability (e.g., join lists with newlines)
Inject into template — Provide clean variables for the judge prompt

This separation keeps the judge prompt readable and focused on semantic evaluation, not raw JSON strings.

Define judge prompts

The judge needs:

System prompt — Instructs the judge on evaluation criteria
User template — Dynamic prompt with {{variable}} placeholders filled from your parsed JSON using Handlebars syntax

JUDGE_SYSTEM_PROMPT = """
You are an expert evaluator assessing the quality of model responses.

Evaluate based on these criteria:
1. **Correctness**: Is the answer factually accurate?
2. **Completeness**: Does it fully address the question?
3. **Clarity**: Is it clear and well-articulated?

Respond with PASS if the answer meets all criteria, FAIL otherwise.
"""

JUDGE_USER_TEMPLATE = """
Question:
{{last_user_turn_content}}

Model answer:
{{answer}}

Confidence: {{confidence}}

Is this answer correct, complete, and well-articulated?
"""

print("📋 Judge system prompt:")
print(JUDGE_SYSTEM_PROMPT)
print("📝 Judge user template (Handlebars variables will be filled at grading time):")
print(JUDGE_USER_TEMPLATE)

Test cases

Create test inputs to validate parsing and template extraction:

# Question to be answered
question = "What is the capital of France?"

instructions = f"""Answer the question. Output must be valid JSON:
{render_schema(AnalysisOutput)}
"""

# Good answer
good_answer = AnalysisOutput(
    answer="The capital of France is Paris. It is located in north-central France on the Seine River.",
    confidence="high"
).model_dump_json(indent=2)

# Poor answer (incomplete/vague)
poor_answer = AnalysisOutput(
    answer="It's a city in France.",
    confidence="low"
).model_dump_json(indent=2)

# Malformed JSON
malformed_answer = """{
  "answer": "Paris",
  "confidence": "very-high"
}"""

good_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", good_answer)],
    metadata={"eval": False}
)

poor_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", poor_answer)],
    metadata={"eval": False}
)

malformed_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", malformed_answer)],
    metadata={"eval": False}
)

print("🟢 Good thread:")
print(good_thread)
print("🟡 Poor thread:")
print(poor_thread)
print("⚠️ Malformed thread:")
print(malformed_thread)

Validation

We cannot run a live LLM judge in this notebook. Instead, we validate:

Malformed JSON is caught and returns -1.0 during training
Valid JSON is parsed correctly
Template variables are extracted and injected correctly

# Test that malformed JSON is caught during parsing
try:
    pydantic_parse(malformed_thread.last_content(), AnalysisOutput)
    print("🚨 ERROR: Should have failed to parse malformed JSON")
except OutputParserException as e:
    print(f"✅ Correctly caught parse error: {e}\n")

# Test that good and poor answers parse correctly
good_output = pydantic_parse(good_thread.last_content(), AnalysisOutput)
print("✅ Good answer parsed:")
print(f"   📝 answer: {good_output.answer[:60]}...")
print(f"   🔒 confidence: {good_output.confidence}\n")

poor_output = pydantic_parse(poor_thread.last_content(), AnalysisOutput)
print("✅ Poor answer parsed:")
print(f"   📝 answer: {poor_output.answer}")
print(f"   🔒 confidence: {poor_output.confidence}")

Template extraction

Extract template context and verify that variables are available for the judge prompt:

# Test template extraction
context = AnalysisGrader.extract_template_context(good_thread, AnalysisOutput)

print("🔍 Extracted template context:")
print(f"  📝 answer: {context['answer']}")
print(f"  🔒 confidence: {context['confidence']}")
print(f"  💬 last_user_turn_content: {context['last_user_turn_content']}")

# Verify that required variables for the template are present
required_vars = ["answer", "confidence", "last_user_turn_content"]
for var in required_vars:
    assert var in context, f"Missing template variable: {var}"
print(f"\n✅ All required template variables extracted: {required_vars}")

Available template variables

When you override extract_template_context(), you have access to these default variables:

output_schema — JSON schema of the output model
turns — Full thread of all turns (system, user, assistant)
metadata — Custom metadata from the thread
context_turns — All turns including context
context_str — String representation of full thread
context_turns_without_last_user — All turns except the last user message
context_str_without_last_user — String without last user message
last_user_turn_content — Content of the last user turn
completion — The model’s completion

You can add custom variables by updating the context dict, as shown in the AnalysisGrader above.

Using the grader in training

To use AnalysisGrader in a real training recipe:

from adaptive_harmony.models import InferenceModel
from adaptive_harmony.graders.templated_prompt_judge import BinaryJudgeOutput

# Assume judge_model is a spawned inference model
grader = AnalysisGrader(
    grader_key="analysis-grader",
    model=judge_model, 
    system_template=JUDGE_SYSTEM_PROMPT,
    user_template=JUDGE_USER_TEMPLATE,
    output_model=BinaryJudgeOutput,
    temperature=0.0,
)

Note:

The model parameter is required and must be an InferenceModel (from a spawned judge model in your recipe)
The output_model must be BinaryJudgeOutput for binary grading
Handlebars templates render variables with {{variable}} syntax

Key takeaways

Use TemplatedPromptJudgeGrader for semantic evaluation — When you need an LLM judge to evaluate extracted content, templated graders provide flexibility and control.
Override extract_template_context() to parse and transform — Parse your structured JSON, extract fields, and inject them as clean variables. This keeps the judge prompt focused and readable.
Handlebars templates make prompts reusable — Use {{variable}} syntax to create flexible judge prompts that work with any structured schema.
Different penalties for training vs. eval — Use -1.0 for format errors during training (stronger signal) and 0.0 during evaluation.
Validate early and return fast — Check for parsing errors in grade() before calling the judge, so malformed outputs don’t waste inference.
The grader requires an InferenceModel — TemplatedPromptJudgeGrader cannot be instantiated without a real judge model. In recipes, you’ll spawn this model and pass it to the grader constructor.

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Grading structured JSON with template injection

Overview

Example scenario

Implementation

How TemplatedPromptJudgeGrader works

Custom grader with extract_template_context override

Why override extract_template_context()?

Define judge prompts

Test cases

Validation

Template extraction

Available template variables

Using the grader in training

Key takeaways

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Documentation Index

​Overview

​Example scenario

​Implementation

​How TemplatedPromptJudgeGrader works

​Custom grader with extract_template_context override

​Why override extract_template_context()?

​Define judge prompts

​Test cases

​Validation

​Template extraction

​Available template variables

​Using the grader in training

​Key takeaways

Overview

Example scenario

Implementation

How TemplatedPromptJudgeGrader works

Custom grader with extract_template_context override

Why override extract_template_context()?

Define judge prompts

Test cases

Validation

Template extraction

Available template variables

Using the grader in training

Key takeaways