> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Inference

> Make requests to models deployed on Adaptive Engine

Send chat completions to deployed models using the Adaptive SDK, OpenAI Python library, or any HTTP client.

If you omit `model`, requests route to the project's default model, or to a model in an active A/B test.

Interactions (prompt + completion pairs) are logged automatically. See [Interactions](/v0.14/core/interactions) for details.

<Tabs>
  <Tab title="SDK" icon="code">
    ## Chat completions

    ```python theme={null}
    response = adaptive.chat.create(
        model="llama-3.1-8b-instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ],
        labels={"project": "support-bot"},
    )
    print(response.choices[0].message.content)
    ```

    | Parameter     | Type        | Description                                    |
    | ------------- | ----------- | ---------------------------------------------- |
    | `model`       | str         | Model key. Omit to use the project default.    |
    | `messages`    | list        | Chat messages with `role` and `content`        |
    | `labels`      | dict        | Key-value pairs for filtering interactions     |
    | `stream`      | bool        | Enable streaming (default: False)              |
    | `temperature` | float       | Sampling temperature                           |
    | `max_tokens`  | int         | Maximum tokens to generate                     |
    | `stop`        | list        | Stop sequences                                 |
    | `top_p`       | float       | Top-p sampling threshold                       |
    | `session_id`  | str or UUID | Session ID for KV-cache reuse across turns     |
    | `store`       | bool        | Whether to log the interaction (default: True) |

    ## Streaming

    ```python theme={null}
    stream = adaptive.chat.create(
        messages=[{"role": "user", "content": "Hello!"}],
        stream=True,
    )
    for chunk in stream:
        if chunk.choices:
            print(chunk.choices[0].delta.content, end="", flush=True)
    ```

    ## Get the completion ID

    Use `completion_id` to log [Metrics](/v0.14/core/metrics) against the response:

    ```python theme={null}
    completion_id = response.choices[0].completion_id
    ```

    ## Vision requests

    Models with the **Multimodal** tag accept images alongside text. Images must be base64-encoded data URIs (JPEG, PNG, WebP, or GIF, up to 10 MB each).

    ```python theme={null}
    import base64

    with open("photo.png", "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = adaptive.chat.create(
        model="your-vlm-key",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this image?"},
                    {"type": "image_url", "image_url": f"data:image/png;base64,{image_data}"},
                ],
            }
        ],
    )
    ```

    ## Structured output

    Pass `response_format` to constrain a completion to a JSON Schema or a Pydantic model. For internal models, invalid tokens are masked at each generation step — the response is structurally guaranteed to parse. For external providers, the schema is forwarded to the provider's native structured-output API.

    ```python theme={null}
    from pydantic import BaseModel

    class Classification(BaseModel):
        label: str
        confidence: float

    response = adaptive.chat.create(
        model="llama-3.1-8b-instruct",
        messages=[{"role": "user", "content": "Classify: 'great product, fast shipping'"}],
        response_format=Classification,
    )
    parsed = Classification.model_validate_json(response.choices[0].message.content)
    ```

    `response_format` accepts a Pydantic `BaseModel` class, a raw JSON Schema envelope (`{"type": "json_schema", "json_schema": {"name": ..., "schema": ...}}`), or `None` (default). Pydantic models are auto-converted via `model_json_schema()` and patched for strict-mode compatibility (refs inlined, `additionalProperties: false` added).

    The response is a JSON string in `response.choices[0].message.content` — the SDK does not auto-deserialize. Call `Model.model_validate_json(...)` to get a typed instance.

    <Accordion title="Schema features supported">
      Constrained decoding compiles the schema to a token-mask grammar. The compiler supports:

      * **Types:** `string`, `integer`, `number`, `boolean`, `null`, `object`, `array`, and union types via `["string", "null"]` syntax
      * **Composition:** `oneOf`, `anyOf`, `allOf` (object merge only), `$ref` and `$defs` (inlined during SDK prep)
      * **Strings:** `minLength`, `maxLength`, `pattern` (regex), `format`, `enum`, `const`
      * **Numbers:** `minimum`, `maximum`, `exclusiveMinimum`, `exclusiveMaximum`. `multipleOf` requires explicit bounds — without bounds, it's silently ignored.
      * **Arrays:** `items`, `minItems`, `maxItems`. Arrays must declare `items`.
      * **Objects:** `properties`, `required`, `additionalProperties` (false / true / schema)

      Recursive schemas with cyclic `$ref` are unrolled to depth 4; deeper nesting is truncated. Format keywords without a regex equivalent are dropped.
    </Accordion>

    ### Streaming with structured output

    `stream=True` and `response_format` work together. Each chunk delivers partial JSON; buffer until the stream closes, then parse:

    ```python theme={null}
    buffer = ""
    stream = adaptive.chat.create(
        messages=[{"role": "user", "content": "Classify: 'late and damaged'"}],
        response_format=Classification,
        stream=True,
    )
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            buffer += chunk.choices[0].delta.content
    parsed = Classification.model_validate_json(buffer)
    ```

    ### Failure modes

    | Situation                                                         | Behavior                                                                                                                                |
    | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
    | Schema fails to compile                                           | The constraint is dropped with a warning in server logs; the model generates unconstrained. Validate your schema during development.    |
    | `max_tokens` exhausted before completion                          | `finish_reason` is `"length"`. The response is truncated JSON — `model_validate_json` will raise. Check `finish_reason` before parsing. |
    | External provider doesn't support structured output for the model | The constraint is dropped silently. Stick to providers and models that support structured output natively.                              |
    | Recursive schema beyond depth 4                                   | Deeper levels are truncated at compile time.                                                                                            |

    See [SDK Reference](/v0.14/reference/sdk) for all chat methods.
  </Tab>

  <Tab title="UI" icon="mouse-pointer">
    ## Chat playground

    Open your project and click **Chat** to interact with deployed models in the browser.

    When a vision model is deployed, the Chat view supports image uploads. Attach images directly in the conversation to send multimodal requests.

    ## Structured output

    Open **Model Settings** and set a **Response Format** to constrain a model to a JSON Schema. The workbench includes presets for classification, entity extraction, and simple object output. Judge graders use the same field to guarantee parseable judgement output.
  </Tab>
</Tabs>

## OpenAI compatibility

Use the OpenAI Python library with your Adaptive deployment:

```python theme={null}
from openai import OpenAI

client = OpenAI(
    base_url=f"{ADAPTIVE_URL}/api/v1",
    api_key=ADAPTIVE_API_KEY,
)

response = client.chat.completions.create(
    model="project_key/model_key",
    messages=[{"role": "user", "content": "Hello!"}],
)
```

Set `model` to `project_key/model_key`. Use `metadata` instead of `labels`.

<Accordion title="Image format difference">
  Multimodal image format differs between Adaptive and OpenAI:

  ```python theme={null}
  # Adaptive format (flat string)
  {"type": "image_url", "image_url": "data:image/png;base64,..."}

  # OpenAI format (nested object)
  {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
  ```
</Accordion>

## HTTP requests

Use any HTTP client to call the chat completions endpoint directly.

<Tabs>
  <Tab title="requests">
    ```python theme={null}
    import requests

    headers = {"Authorization": "Bearer ADAPTIVE_API_KEY"}
    payload = {
        "model": "project_key/model_key",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ],
        "labels": {"project": "support-bot"},
    }

    response = requests.post(
        url="ADAPTIVE_URL/api/v1/chat/completions",
        json=payload,
        headers=headers,
    )
    completion_text = response.json()["choices"][0]["message"]["content"]
    ```
  </Tab>

  <Tab title="curl">
    ```bash theme={null}
    curl "$ADAPTIVE_URL/api/v1/chat/completions" \
      -X POST \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $ADAPTIVE_API_KEY" \
      -d '{
        "model": "project_key/model_key",
        "messages": [{"role": "user", "content": "Hello!"}],
        "labels": {"project": "support-bot"}
      }'
    ```
  </Tab>
</Tabs>

See [API Reference](/v0.14/reference/api) for the full endpoint specification.
