Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This cookbook demonstrates how to grade models that produce structured JSON output. The key idea: extract specific fields from the JSON and inject them into a custom judge prompt template. This is useful when:
  • Your model produces structured output (JSON with multiple fields)
  • You want an LLM judge to evaluate the content of specific fields semantically
  • You need to transform or format extracted fields before passing them to the judge

Example scenario

A model answers a question and produces structured output:
{
  "answer": "The capital of France is Paris.",
  "confidence": "high"
}
The grader:
  1. Parses the JSON and extracts the answer field
  2. Injects it into a judge template along with the original question
  3. Sends to an LLM judge: “Is this answer correct and well-supported?”
  4. Returns the judge’s PASS (1.0) or FAIL (0.0) score

Implementation

Define a simple Pydantic schema:
from typing import Literal
from pydantic import BaseModel, Field

from adaptive_harmony import StringThread, Grade
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException, render_schema
from adaptive_harmony.graders.templated_prompt_judge import TemplatedPromptJudgeGrader


class AnalysisOutput(BaseModel):
    """Schema for structured analysis output."""

    answer: str = Field(description="The model's answer to the question")
    confidence: Literal["high", "medium", "low"] = Field(description="Model's confidence in the answer")

How TemplatedPromptJudgeGrader works

TemplatedPromptJudgeGrader allows you to:
  1. Parse structured output from a model
  2. Extract specific fields and transform them
  3. Inject them into a Handlebars template using {{variable}} syntax
  4. Send the rendered prompt to an LLM judge for semantic evaluation
The key mechanism is overriding extract_template_context(). This method:
  • Parses your structured JSON
  • Builds a dict of variables to inject into the template
  • Returns the dict for Handlebars rendering
When grading:
  1. extract_template_context() parses the JSON and builds context variables
  2. The user template is rendered with those variables using Handlebars
  3. The rendered prompt + judge system prompt are sent to an LLM judge
  4. The judge returns a binary decision (PASS/FAIL → 1.0/0.0)

Custom grader with extract_template_context override

class AnalysisGrader(TemplatedPromptJudgeGrader):
    """Custom grader for structured analysis output.

    Parses JSON, extracts the answer field, and sends it to an LLM judge
    for semantic evaluation.
    """

    @classmethod
    def extract_template_context(cls, thread: StringThread, output_model=None, template_variables=None):
        """Parse structured JSON and inject extracted fields into the judge prompt.

        Returns a dict with keys that match {{variables}} in your user_template.
        """
        # Get default context from parent (includes thread, last_user_turn_content, etc.)
        default_context = super().extract_template_context(thread, output_model, template_variables)

        # Parse the model's last response as JSON and extract fields
        try:
            structured_output = pydantic_parse(thread.last_content(), AnalysisOutput)
            # Add extracted fields to context for template injection
            default_context["answer"] = structured_output.answer
            default_context["confidence"] = structured_output.confidence
        except OutputParserException:
            # If parsing fails, we'll handle it in grade() method
            pass

        return default_context

    async def grade(self, sample: StringThread) -> Grade:
        """Grade the sample. If JSON parsing fails, return a penalty score."""
        try:
            # Try to parse to ensure valid JSON
            pydantic_parse(sample.last_content(), AnalysisOutput)
            # If valid, call parent's grade method which uses the judge
            return await super().grade(sample)
        except OutputParserException:
            # Model output was not valid JSON or didn't match the schema
            is_eval = sample.metadata.get("eval", False)
            score = 0.0 if is_eval else -1.0
            return Grade(
                value=score,
                grader_key=self.grader_key,
                reasoning="Malformed JSON: does not match expected schema",
            )

Why override extract_template_context()?

The parent class provides basic context variables (thread, last_content, last_user_turn_content). But with structured output, you need to:
  1. Parse the JSON — Convert raw string to Pydantic model
  2. Extract fields — Get specific values (e.g., answer, confidence)
  3. Transform — Format them for readability (e.g., join lists with newlines)
  4. Inject into template — Provide clean variables for the judge prompt
This separation keeps the judge prompt readable and focused on semantic evaluation, not raw JSON strings.

Define judge prompts

The judge needs:
  • System prompt — Instructs the judge on evaluation criteria
  • User template — Dynamic prompt with {{variable}} placeholders filled from your parsed JSON using Handlebars syntax
JUDGE_SYSTEM_PROMPT = """
You are an expert evaluator assessing the quality of model responses.

Evaluate based on these criteria:
1. **Correctness**: Is the answer factually accurate?
2. **Completeness**: Does it fully address the question?
3. **Clarity**: Is it clear and well-articulated?

Respond with PASS if the answer meets all criteria, FAIL otherwise.
"""

JUDGE_USER_TEMPLATE = """
Question:
{{last_user_turn_content}}

Model answer:
{{answer}}

Confidence: {{confidence}}

Is this answer correct, complete, and well-articulated?
"""

print("📋 Judge system prompt:")
print(JUDGE_SYSTEM_PROMPT)
print("📝 Judge user template (Handlebars variables will be filled at grading time):")
print(JUDGE_USER_TEMPLATE)

Test cases

Create test inputs to validate parsing and template extraction:
# Question to be answered
question = "What is the capital of France?"

instructions = f"""Answer the question. Output must be valid JSON:
{render_schema(AnalysisOutput)}
"""

# Good answer
good_answer = AnalysisOutput(
    answer="The capital of France is Paris. It is located in north-central France on the Seine River.",
    confidence="high"
).model_dump_json(indent=2)

# Poor answer (incomplete/vague)
poor_answer = AnalysisOutput(
    answer="It's a city in France.",
    confidence="low"
).model_dump_json(indent=2)

# Malformed JSON
malformed_answer = """{
  "answer": "Paris",
  "confidence": "very-high"
}"""

good_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", good_answer)],
    metadata={"eval": False}
)

poor_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", poor_answer)],
    metadata={"eval": False}
)

malformed_thread = StringThread(
    [("system", instructions), ("user", question), ("assistant", malformed_answer)],
    metadata={"eval": False}
)

print("🟢 Good thread:")
print(good_thread)
print("🟡 Poor thread:")
print(poor_thread)
print("⚠️ Malformed thread:")
print(malformed_thread)

Validation

We cannot run a live LLM judge in this notebook. Instead, we validate:
  1. Malformed JSON is caught and returns -1.0 during training
  2. Valid JSON is parsed correctly
  3. Template variables are extracted and injected correctly
# Test that malformed JSON is caught during parsing
try:
    pydantic_parse(malformed_thread.last_content(), AnalysisOutput)
    print("🚨 ERROR: Should have failed to parse malformed JSON")
except OutputParserException as e:
    print(f"✅ Correctly caught parse error: {e}\n")

# Test that good and poor answers parse correctly
good_output = pydantic_parse(good_thread.last_content(), AnalysisOutput)
print("✅ Good answer parsed:")
print(f"   📝 answer: {good_output.answer[:60]}...")
print(f"   🔒 confidence: {good_output.confidence}\n")

poor_output = pydantic_parse(poor_thread.last_content(), AnalysisOutput)
print("✅ Poor answer parsed:")
print(f"   📝 answer: {poor_output.answer}")
print(f"   🔒 confidence: {poor_output.confidence}")

Template extraction

Extract template context and verify that variables are available for the judge prompt:
# Test template extraction
context = AnalysisGrader.extract_template_context(good_thread, AnalysisOutput)

print("🔍 Extracted template context:")
print(f"  📝 answer: {context['answer']}")
print(f"  🔒 confidence: {context['confidence']}")
print(f"  💬 last_user_turn_content: {context['last_user_turn_content']}")

# Verify that required variables for the template are present
required_vars = ["answer", "confidence", "last_user_turn_content"]
for var in required_vars:
    assert var in context, f"Missing template variable: {var}"
print(f"\n✅ All required template variables extracted: {required_vars}")

Available template variables

When you override extract_template_context(), you have access to these default variables:
  • output_schema — JSON schema of the output model
  • turns — Full thread of all turns (system, user, assistant)
  • metadata — Custom metadata from the thread
  • context_turns — All turns including context
  • context_str — String representation of full thread
  • context_turns_without_last_user — All turns except the last user message
  • context_str_without_last_user — String without last user message
  • last_user_turn_content — Content of the last user turn
  • completion — The model’s completion
You can add custom variables by updating the context dict, as shown in the AnalysisGrader above.

Using the grader in training

To use AnalysisGrader in a real training recipe:
from adaptive_harmony.models import InferenceModel
from adaptive_harmony.graders.templated_prompt_judge import BinaryJudgeOutput

# Assume judge_model is a spawned inference model
grader = AnalysisGrader(
    grader_key="analysis-grader",
    model=judge_model, 
    system_template=JUDGE_SYSTEM_PROMPT,
    user_template=JUDGE_USER_TEMPLATE,
    output_model=BinaryJudgeOutput,
    temperature=0.0,
)
Note:
  • The model parameter is required and must be an InferenceModel (from a spawned judge model in your recipe)
  • The output_model must be BinaryJudgeOutput for binary grading
  • Handlebars templates render variables with {{variable}} syntax

Key takeaways

  1. Use TemplatedPromptJudgeGrader for semantic evaluation — When you need an LLM judge to evaluate extracted content, templated graders provide flexibility and control.
  2. Override extract_template_context() to parse and transform — Parse your structured JSON, extract fields, and inject them as clean variables. This keeps the judge prompt focused and readable.
  3. Handlebars templates make prompts reusable — Use {{variable}} syntax to create flexible judge prompts that work with any structured schema.
  4. Different penalties for training vs. eval — Use -1.0 for format errors during training (stronger signal) and 0.0 during evaluation.
  5. Validate early and return fast — Check for parsing errors in grade() before calling the judge, so malformed outputs don’t waste inference.
  6. The grader requires an InferenceModelTemplatedPromptJudgeGrader cannot be instantiated without a real judge model. In recipes, you’ll spawn this model and pass it to the grader constructor.