Document Parser

Document Parser Engine Inputs

Document Parser Configuration

The Document Parser Engine Configuration has three parameters that take in values:

pdf_file: required. The PDF document to parse from.
page_filter: optional. A string describing which PDF pages to consider during parsing. Otherwise, considers all pages.
features: required. Defines what you want to parse from your PDFs. Options include:
- pages -> Markdown
- tables -> CSV-like format
- charts -> CSV-like format
- logos -> array of strings identifying the logos
- photos -> text description.

See Template Strings for dynamic parameter configuration.

Document Parser Output

The output will always be a JSON value containing the selected features from Engine configuration.

Document Parser Example

Let’s run through an example using this engine together.

Create an Agent

Click on the “Add Agent” button in the top right corner of the Agents page.

Enter a name and an optional description of your Agent.

Select the Document Parser Engine

Configure the engine as follows

$ starts a template string

pdf_file: $pdf_file
page_filter: Remove $page_filter and leave this empty
features: Check all features

Your configuration should look like the image below.

Create the Agent

Hit the Create button. Now, let’s run it on a PDF file through the UI.

View the Agent you just created

Create a new Agent job

Download the PDF from the default Roe Datasets

Click on Resources in the sidebar and then Roe Datasets.

Click on View dataset for the pdf-parsing-example dataset.

Click on Download to download the file.

Fill in the Agent inputs

pdf_file: Upload the file you just downloaded.

Run the job

Hit the Create button at the bottom to start the PDF parsing job.

View the Results

Click View of the respective job to view its status and results.

Scroll down the Agent Job Details page and you’ll see the job outputs.

You can see that for each page, all the features we selected during the Engine configuration are present. For example, we wanted to extract logos from the pages, so the JSON output includes the “Pinterest” logo—which was on the first page of the PDF—for the first page.

Get Started

Agents

VolansDB

Data Resources

Enterprise

Use Cases

Document Parser Engine Inputs

Document Parser Output

Document Parser Example

Get Started

Agents

VolansDB

Data Resources

Enterprise

Use Cases

​Document Parser Engine Inputs

​Document Parser Output

​Document Parser Example

Document Parser Engine Inputs

Document Parser Output

Document Parser Example