Text Extraction Engine Inputs

Text Extraction Configuration

The Text Extraction Engine Configuration has four parameters that take in values:

  • instruction: optional. A string used to prompt the Agent during job execution.
  • text: required. The text input to extract from.
  • model: required. The model to use for extraction.
  • output_schema: optional. Defines the exact structure of the JSON output that the extracted data will populate. Follows the standard JSON schema specification.

See Template Strings for dynamic parameter configuration.

Text Extraction Output

The output will always be a JSON value of the structure specified in the output_schema (if you defined it).

Text Extraction Example

Let’s run through an example using this engine together.

1

Create an Agent

Click on the “Add Agent” button in the top right corner of the Agents page.

Enter a name and an optional description of your Agent.

2

Select the Text Extraction Engine

3

Remove the model input from Agent Input Definition

Remove the model Agent input

4

Configure the engine as follows

$ starts a template string
  • instruction: $instruction

  • text: $text

  • model: gpt-4o

  • output_schema: Copy and paste the JSON schema below (hit Use Text) or refer to the image below for using the UI widget to define the JSON schema.

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Title of the text"
    },
    "author": {
      "type": "string",
      "description": "Author of the text"
    },
    "content": {
      "type": "string",
      "description": "Summary of the content of the text"
    }
  },
  "description": "important information about the text"
}

Defining output_schema using the UI Widget

5

Create the Agent

Hit the Create button. Now, let’s run it on text input through the UI.

6

View the Agent you just created

7

Create a new Agent job

8

Fill in the Agent inputs

Paste in the following text for the text input field:

Title: The Role of AI in Managing Unstructured Data in Modern Data Warehouses
Author: GPT

In the rapidly evolving field of artificial intelligence, data plays a pivotal role. The ability to extract, classify, and retrieve information from diverse data sources such as documents, webpages, videos, images, and audio is crucial for developing intelligent systems. Advanced AI models, like those developed by Roe AI, enable seamless integration and utilization of unstructured data within data warehouses.

Data warehouses traditionally handle structured data, but the growing volume of unstructured data requires more sophisticated solutions. By leveraging AI-powered SQL, Roe AI provides tools that not only store but also process and analyze unstructured data. This technology allows users to perform complex queries, automate data classification, and enhance retrieval-augmented generation (RAG) processes.

The implications of such technology are vast, impacting various industries from healthcare to finance. For example, in healthcare, AI can extract relevant patient information from medical records, aiding in faster diagnosis and personalized treatment plans. In finance, AI-driven data extraction can streamline regulatory compliance and fraud detection by analyzing large volumes of transactions and communications.

In conclusion, the integration of AI with data warehouses signifies a major advancement in data management and utilization. Companies that adopt these technologies can unlock valuable insights from their unstructured data, driving innovation and efficiency in their operations.

Here are the filled-in Agent inputs:

Sometimes, you need to experiment with the output_schema configuration and the prompts you pass in to the instruction to get the results you want.
9

Run the job

Hit the Create button at the bottom to start the text extraction job.

10

View the Results

Click View of the respective job to view its status and results.

Scroll down the Agent Job Details page and you’ll see the job outputs.

Notice that the JSON output will be in the structure that you defined in the output_schema. In our case, we defined our output to be a JSON Object with certain properties to be filled in by the Agent.