> ## Documentation Index
> Fetch the complete documentation index at: https://docs.roe-ai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Multimodal Insights

> Extracts insights and structured information from text and images using AI vision models.

## Multimodal Insights Engine Overview

The Multimodal Insights Engine is a unified engine that can extract structured information from both text and images. It replaces the previous separate Text Insights and Image Insights engines, providing a single powerful interface for multimodal content analysis.

## Engine Inputs

The Multimodal Insights Engine Configuration has the following parameters:

* **text**: *optional.* Text content to analyze and extract information from. Leave blank to extract from images only.
* **images**: *optional.* Image URLs or file IDs (comma/newline/whitespace separated). Leave blank to extract from text only.
* **instruction**: *optional.* Instructions describing what to extract from the content. Defaults to empty.
* **model**: *optional.* The AI model to use (default: `gpt-4.1-2025-04-14`).
* **reasoning**: *optional.* Reasoning effort level for the model. Options: `low`, `medium`, `high`.
* **output\_schema**: *required.* JSON schema defining the structure of data to extract. Follows the standard [JSON schema specification](https://json-schema.org/).

<Note>At least one of **text** or **images** must be provided.</Note>

See [Template Strings](/agents/input-definition#template-strings) for dynamic parameter configuration.

## Engine Output

The output will be a JSON value matching the structure specified in the **output\_schema**.

## Text Extraction Example

Extract structured information from text content:

<Steps>
  <Step title="Create an Agent">
    Click on the "Add Agent" button in the top right corner of the Agents page.

    <Frame>
      <img src="https://mintcdn.com/roeai/qeWYCF2quzHQHhsD/images/add-agent.png?fit=max&auto=format&n=qeWYCF2quzHQHhsD&q=85&s=b3e1ec9b816ed1e57cb1ecfa53ff4288" width="1920" height="1045" data-path="images/add-agent.png" />
    </Frame>

    Enter a name and an optional description of your Agent.
  </Step>

  <Step title="Select the Multimodal Insights Engine" />

  <Step title="Configure the engine for text extraction">
    <Info>\$ starts a template string</Info>

    * **text**: \$text
    * **images**: (leave empty)
    * **instruction**: Analyze the text to extract key information and insights
    * **model**: gpt-4.1-2025-04-14
    * **output\_schema**: Copy and paste the JSON schema below:

    ```json theme={null}
    {
      "type": "object",
      "properties": {
        "title": {
          "type": "string",
          "description": "Main title or heading"
        },
        "summary": {
          "type": "string",
          "description": "Brief summary of the content"
        },
        "key_points": {
          "type": "array",
          "items": { "type": "string" },
          "description": "List of key points extracted"
        }
      },
      "required": ["title", "summary"]
    }
    ```
  </Step>

  <Step title="Create the Agent">
    Hit the **Create** button.
  </Step>

  <Step title="Run a job with text input">
    Create a new job and provide text content to analyze. The engine will extract structured information based on your output schema.
  </Step>
</Steps>

## Image Extraction Example

Extract structured information from images:

<Steps>
  <Step title="Create an Agent">
    Click on the "Add Agent" button and enter a name for your Agent.
  </Step>

  <Step title="Select the Multimodal Insights Engine" />

  <Step title="Configure the engine for image extraction">
    <Info>\$ starts a template string</Info>

    * **text**: (leave empty)
    * **images**: \$images
    * **instruction**: Analyze the image and describe what you see in detail
    * **model**: gpt-4.1-2025-04-14
    * **output\_schema**: Copy and paste the JSON schema below:

    ```json theme={null}
    {
      "type": "object",
      "properties": {
        "description": {
          "type": "string",
          "description": "Detailed description of the image content"
        },
        "objects": {
          "type": "array",
          "items": { "type": "string" },
          "description": "List of objects identified in the image"
        },
        "text_content": {
          "type": "string",
          "description": "Any text visible in the image"
        }
      },
      "required": ["description"]
    }
    ```
  </Step>

  <Step title="Create and run the Agent">
    Hit **Create**, then run a job by providing an image URL to analyze.
  </Step>
</Steps>

## Combined Text and Image Analysis

You can also provide both text and images for combined analysis. The engine will process both modalities together.
