Skip to main content

Document Segmentation Engine Overview

The Document Segmentation Engine analyzes PDF documents and extracts specific page ranges based on semantic descriptions, explicit page ranges, or table of contents entries. It supports multi-stage filtering with optimized processing for large documents.

Engine Inputs

The Document Segmentation Engine Configuration has the following parameters:
  • page_description: required. A semantic description of the pages to select. Supports multiple formats:
    • Semantic search: Natural language description of target pages (e.g., “Any page containing financial data”)
    • @PAGERANGE prefix: Explicit page ranges (e.g., @PAGERANGE: 3, 5-15)
    • @TOC prefix: Table of contents entries (e.g., @TOC: Chapter 1, Section 2)
    • Can combine multiple formats in one query
  • pdf_file: required. PDF file to select pages from. Can be file uploads, URLs, or file IDs.
  • model: optional. The AI model to use (defaults to tier 1 image model).
Example page_description:
@TOC: Chapter 1, Section 2, some-relevant-text;
@PAGERANGE: 3, 5-15;
Any page containing financial statements
See Template Strings for dynamic parameter configuration.

Engine Output

The output is a list of page ranges matching the specified criteria.