Skip to main content

Document Insights Engine Overview

The Document Insights Engine extracts structured information, insights, and data from PDF documents using AI. It processes documents in batches, supports schema-based extraction, source tracing, and handles multi-page documents efficiently.

Engine Inputs

The Document Insights Engine Configuration has the following parameters:
  • instructions: required. Instructions for the AI describing what to extract from the PDF. Can also use alias instruction.
  • pdf_files: required. PDF files to extract insights from. Can be file uploads, URLs, or file IDs. Supports multiple files (comma/newline separated). Can also use alias pdf_file.
  • batch_size: optional. Number of PDF pages to process in each batch (default: 10, range: 1-50). Larger batches are faster but may affect quality.
  • model: optional. The AI model to use (default: gpt-4.1-2025-04-14).
  • temperature: optional. Controls randomness in output (default: 0.0). Range: 0.0 (deterministic) to 1.0 (most random).
  • use_source: optional. Trace the sources of extracted data (default: False). Adds cost when enabled.
  • convert_to_images: optional. Convert PDF pages to images and send to the model along with text (default: True). Improves extraction for visual documents.
  • output_schema: optional. JSON schema defining the structure of data to extract. Follows the standard JSON schema specification.
See Template Strings for dynamic parameter configuration.

Engine Output

The output will be a JSON value matching the structure specified in the output_schema (if defined). Without a schema, returns extracted content in a default format.

Example Usage

1

Create an Agent

Click on the “Add Agent” button in the top right corner of the Agents page.
Enter a name and an optional description of your Agent.
2

Select the Document Insights Engine

3

Configure the engine

$ starts a template string
  • instructions: $instructions
  • pdf_files: $pdf_files
  • batch_size: 10
  • model: gpt-4.1-2025-04-14
  • temperature: 0.0
  • convert_to_images: True
  • output_schema: Copy and paste the JSON schema below:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "row_name": {
        "type": "string",
        "description": "The name of a row"
      },
      "value_1": {
        "type": "string",
        "description": "The monetary value for the first period"
      },
      "value_2": {
        "type": "string",
        "description": "The monetary value for the second period"
      },
      "type": {
        "type": "string",
        "description": "Asset type classification"
      }
    },
    "description": "A row in the financial statement"
  },
  "description": "All entries from the financial statement table"
}
You can click Use Widget to view the JSON schema in the UI.
4

Create the Agent

Hit the Create button.
5

Run a job

Create a new job and provide:
  • instructions: “Extract all rows from the ASSETS table in the balance sheet”
  • pdf_files: Upload or provide a URL to a PDF document
6

View the Results

Click View on the job to see its status and results.
Scroll down to see the extracted data matching your output schema.