PDF Parser Engine Inputs

PDF Parser Configuration

The PDF Parser Engine Configuration has three parameters that take in values:

  • pdf_file: required. The PDF document to parse from.
  • page_filter: optional. A string describing which PDF pages to consider during parsing. Otherwise, considers all pages.
  • features: required. Defines what you want to parse from your PDFs. Options include:
    • pages -> Markdown
    • tables -> CSV-like format
    • charts -> CSV-like format
    • logos -> array of strings identifying the logos
    • photos -> text description.

See Template Strings for dynamic parameter configuration.

PDF Parser Output

The output will always be a JSON value containing the selected features from Engine configuration.

PDF Parser Example

Let’s run through an example using this engine together.

1

Create an Agent

Click on the “Add Agent” button in the top right corner of the Agents page.

Enter a name and an optional description of your Agent.

2

Select the PDF Parser Engine

3

Configure the engine as follows

$ starts a template string
  • pdf_file: $pdf_file

  • page_filter: Remove $page_filter and leave this empty

  • features: Check all features

Your configuration should look like the image below.

4

Create the Agent

Hit the Create button. Now, let’s run it on a PDF file through the UI.

5

View the Agent you just created

6

Create a new Agent job

7

Download the PDF from the default Roe Datasets

Click on Resources in the sidebar and then Roe Datasets.

Click on View dataset for the pdf-parsing-example dataset.

Click on Download to download the file.

8

Fill in the Agent inputs

pdf_file: Upload the file you just downloaded.

9

Run the job

Hit the Create button at the bottom to start the PDF parsing job.

10

View the Results

Click View of the respective job to view its status and results.

Scroll down the Agent Job Details page and you’ll see the job outputs.

You can see that for each page, all the features we selected during the Engine configuration are present. For example, we wanted to extract logos from the pages, so the JSON output includes the “Pinterest” logo—which was on the first page of the PDF—for the first page.